As a matter of general principle, I’m leery of Google’s ongoing efforts to make itself an arbiter of content, particularly since its search engine has become utterly unusable for anything more complicated than finding the location of a store or doing price comparisons. I now regularly get first page results where over eight items don’t match my search parameters. So I’ve abandoned Google.
Google has teamed up with ProPublica, apparently to give the initiative a veneer of legitimacy, to develop a “hate crimes” database. As reported last week in TechCrunch:
In partnership with ProPublica, Google News Lab is launching a new tool to track hate crimes across America. Powered by machine learning, the Documenting Hate News Index will track reported hate crimes across all 50 states, collecting data from February 2017 onward.
“Powered by machine learning” is meant to give the sheen of tech gee-whizzery and overawe skeptics. But machine learning is not magical. Amusingly, even the normally sober Wikipedia entry bothers to point out machine learning is overhyped:
As of 2016, machine learning is a buzzword, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations.[13] Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, machine-learning programs often fail to deliver.
The program is given sample inputs, such as a training set, which establishes parameters and then it proceeds from there.
The problems with machine learning, or any type of AI, is that it becomes a black box, since the AI establishes additional decision parameters beyond its original training and they are inaccessible. So the logic of particular decisions cannot be made explicit.
Our Clive explains the inherent impediments to establishing sound training parameters:
The entire premise is also bogus (from the brief and not especially coherent information given in the piece). This is because it completely glosses over the fundamental but very knotty problem of source data credibility. How do you establish traceability from the source dataset(s) to the output reporting?
If, say, for the sake of example, some conclusion was drawn based on treating Naked Capitalism as a trusted and reliable source, then I would be satisfied about the validity of the information. But as we’ve seen first-hand with Prop or Not and its self-substantiated assertion to be a resource for deciding what is real vs. fake news or independent vs. influenced reporting, ultimately this assessment of source reliability is a judgement call. Who, exactly, is making those judgements? On what basis? By what rule set?
During the Big Data fad of 2012, my TBTF attempted to get a grasp on the mushrooming proliferation of disconnected, non-standardised data it had at its disposal. We spent a great deal of time and money attempting to implement ASG Technologies Metadata Repository platform. The results were a disaster. Different source data was of different, highly variable quality and reliability. So we had to try to define rules as to what weighting we could assign to what data sources. We needed rules, because otherwise it was a purely subjective decision either by an individual or, worse, a group. There was absolutely no consistency.
But no-one could come up with an agreed set of data quality and data source weighted rules. Even though some organisation-wide rules were defined, they were frequently ignored because – like a lot of committee decisions which attempted to find a consensus – they suffered from the “not invented here” phenomenon. So, ProPublica too will have to pick its poison of either decisive but autocratic evaluation of what is good, trustworthy source data (and thus susceptible to individual biases) or a mushy broad-church inclusive approach which will tell you nothing as it will give equal weight to using Reddit and ZeroHedge as a source to that of using the London Review of Books.
Another reader who has set up and operated ginormous complex databases, took a look at the project and didn’t like what he saw:
A quick check shows their analysis isn’t working. Click last week as a filter. A large number of people are types of people. Greg Taylor, second on the list, is an Indiana politician arguing it’s past time they created a hate crime law there. Jerome Vanghaluwe is a guy the thugs are accusing of being the real driver: the original owner of the car who was nowhere near Charlottesville. Jerome is planning to sue.
I don’t know which Google AI engine they’re using but it isn’t working well for even simple identification. Individual people and types of people (police, Orthodox Jews) are co-mingled with real people. There’s also a sourcing problem which might be how the falsely accused Jerome managed to crawl towards the top. Are they going to weigh the NYT and Infowars equally? Or are they going to entirely exclude the former, for copyright reasons, and the latter, because it’s fiction (even though a great many people believe it)?
I can see why Google and FB would want to study and sort news but this project is too young to have left the lab. Open sourcing code is a good idea: the process, when it works, makes for stronger code. Running it and reporting results – when it’s not baked and knowing people will rely on results – is irresponsible.
As this source pointed out later, the results were skewed: “this week” was only only one day of data. It appears they are not feeding it more data beyond the initial recordset. And that dataset seemed far too small to do the job: 4,000 news reports that vaguely mentioned hate crimes, and on top of that were not well tuned.
Frankly, this is bizarre, and even more bizarre to expose what is meant to be a major project that is so obviously delivering terrible results at this stage. It looks as if Google and perhaps even more ProPublica wanted to get in front of the “hate crimes/speech” bandwagon after Charlottesville to head off other potential competitors for funding and journalists’ attention.