The Haystack Fallacy – or Why Big Data Provides Little Security


Ann Rudinow Saetnan

Arguments for Europe’s Data Retention Directive and revelations from Edward Snowden indicate something of a Big Data fetish on the part of security forces. This urge for Big Data solutions is also being nurtured within the fields of marketing and of computer science. Examples from the analysis of transaction data gleaned from loyalty cards ("Google Flu Trends" and "Target Pregnancy Diagnosis") have entered the canon of university textbooks where they play a role reminiscent of origin myths. And Chris Anderson's 2008 article in Wired ("The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16.07) - somewhat ironically - serves as a theoretical justification for developing new algorithmic tools for pattern recognition and applying these far beyond the field of marketing, for instance in the natural sciences.
Anderson offers four arguments as to why Big Data - new statistical techniques for analyzing the vast amounts of transaction and location data gleaned from the myriad sites where we now communicate and act electronically - have (or should have) rendered established scientific methods in statistics obsolete:
- First, according to Anderson, N now equals, or will soon equal, all. It is becoming impossible to live ones life "off the grid". Thanks to smart technologies at home and in public spaces, more and more of what we do leaves electronic traces that are collected by equipment and/or service providers according to the user agreements we must sign in order to obtain those goods and/or services. According to Anderson, the sum total of these data potentially tells all one might wish to know about all of us.
- Second, and as a corollary to the first argument, analysis of data mined from this mass is no longer a sample. Thus the statistical tests designed to estimate sampling error are no longer relevant.
- Third, since all is now known and now included in the database, there is no longer any need for causal or predictive modeling. A pattern observed is a pattern in reality, not an hypothesis about a pattern in some micro-cosmos that might, or might not, prove generalizable to a larger context.
- Fourth, Anderson claims that correlation within the mass of meta-data is all we need to know. What we need - all we need - is algorithmic robots that can scan the vast mass of locational and transactional data for correlational patterns. Content data is not necessary. Pattern says all we need to know about content, not only for advertising but also for policing, disease tracking, science ...

This is, recognizably and arguably, the ethos underlying the massive tapping of data flows by security organizations such as NSA and GCHQ. Likewise it is, or was, the vision underlying the European Data Retention Directive. We are told that security forces need to collect all our metadata if they are to keep us safe from terrorism, from organized crime, from child abuse and child pornography, etc. And computational service providers seem eager to promise that, with enough data and once they develop the right algorithms, they will be able to predict and prevent crimes.
This paper will, in simple terms, debunk the Big Data arguments one by one. Even though crunching through masses of transaction data for purchasing patterns and using these as a basis for marketing may improve profit margins, not one of the arguments for the obsolescence of scientific method holds true. Nor do the origin stories hold up under critical examination.

Origin story 1, Google flu trends: Yes, Google did manage, in one year of tracking and predicting flu cases based on numbers of internet searches using terms such as "cough" and "fever", to declare and estimate the scope of a flu epidemic some time before the Centers for Disease Control reached similar conclusions based on diagnosed cases. The explanation offered is that people now tend to search the web for remedies and diagnostic help before going to see their doctor. However, that firstyear's success has not been repeated. In subsequent years, Google has consistently got their flu case estimates wrong by substantial margins.

Origin story 2, Target pregnancy diagnoses: This is the story of the man who came to Target to complain that they were targeting his teenage daughter with baby ads ... only to come back some days later to apologize: Target had got it right. His daughter was pregnant. This features as a success story in computer science textbooks. But we are not told how many such ads were sent out. Target may not even know how many of those were tossed away as irrelevant. Even when Target "got it right", the arrival of special ads for baby photos, nursery furniture, diapers etc. was sometimes unwelcome. Even for a healthy, happy pregnancy, Target claiming to know of it before the woman told friends and family could easily have felt creepy. Target has since taken to disguising targeted ads by including them in generic booklets of coupons and specials. But this part of the story is not told in the textbooks. The origin story in the textbooks is a fairy tale of success, not a "true grit" story of questionable ethics and mixed results.

Theorem 1, N = all: No, N is not all. N may be most, but those missing are missing for systematic reasons that introduce bias in the database.

Theorem 2, not a sample: Yes, even if N = all for the time period covered by the database, it is still a sample from an endless time continuum. There is good reason for caution in using some statistical tests. For instance, the larger the database, the easier it is to achieve p < 0.001, even for patterns so subtle that they are meaningless or for fleeting correlations gone like a mist in the next cycle of time. But while traditional statistical tests are not a measure of truth, they are a measure of the robustness (or lack thereof) of a finding. Theorem 3, causal models not necessary: Yes they are. If we are using the analysis to advise on policy, we need to accept that policy exists in a causal world. All rational action presumes causality in that we act in order to achieve an effect. If we are seeking out patterns as evidence of crime, we must accept that causality and intentionality are key legal principles. Theorem 4, correlation is sufficient. Well, that would be nice, but the fact is that the larger the database, the more likely it is that we will find spurious correlations. The paper presentation will demonstrate this from a website devoted to finding them. The paper then goes on to demonstrate why, even with improbably precise algorithms, applying this pattern recognition method to security issues creates dangerous and anti-democratic situations without providing the promised level of security. This part of the paper is based on a simple format for calculating the predictive value of positive and negative findings. What the calculation demonstrates is that, even for improbably precise algorithms, the predictive value of their results depends more on the size of the haystack relative to the number of needles it contains. Because of the vastly larger amount of hay relative to needles in the stack, even a mere 1% error in sensitivity (i.e. 99% of needles found, only 1% missed) and specificity (99% of straw left aside, only 1% misidentified as needles) may well leave you with less than 1% needles in your needle-identified stack. In conclusion, this result is discussed in terms of the consequences of misidentification in different contexts. In an advertising context, the consequences may be trivial. In a security context, both false negatives and false positives may be intolerable. Keywords: BigData, security, statistical, database.