Common Failings of Big Data Analysis

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Common failings of Big Data analysis:

There are numerous examples of how Big Data can be used to forecast the publics reaction when it comes to box office receipts, sale of consumer goods and the outcome of certain events such as American Idol. However, even in the case of the of predicting something as ridiculously inane as American Idol there are qualifications that need to be made about the use of the data collected. “As many authors have pointed out, there

are several challenges one must face when dealing with data of this nature: intrinsic biases, uneven sampling across location of interest etc.” American Idol.

While the experiment in american idol is largely viewed as success it concludes that the open source data available on the web can be used to make educated

guesses on the outcome of societal events. Surely an educated guess is nothing to get excited about.

This section of the paper points attempts to bring to light the failings in analysis of data sourced from social media such as twitter or from search terms used in Google searches. We focus on three distinct areas which have attempted to use these sources of information to predict future outcomes of some event. These areas are:

  • Elections
  • Flu Trends
  • Stock Market trends


Shortly after the 2010 US general elections flamboyant statements made it to the news media headlines, from those arguing that Twitter is not a reliable predictor to those claiming the opposite (How not to predict elections).

It has been claimed that Twitter can predict the outcome of elections with great accuracy. Given the significant differences in the demographics between likely voters and users of social networks questions arise on what is the underlying operating principle enabling these predictions (How not to predict elections).

As is reported by “How not to predict elections” the degree of accuracy in these claims is recorded in terms of percentage of correctly guessed electoral races without any further qualification at all.

When these predictions are reported they are often not compared against results which were arrived at by more traditional means. For instance in the 2008 US congressional elections the incumbent in won 91.6 % of the time and in 2010 they won 84% of the time. By using this parameter that the incumbent wins about 9 times out of ten any random member of the public could walk off the street and predict 90% of US congressional elections at very little cost.

A Livne, M Simmons, E Adar and L Adamic, “The Party is over here structure and content in the 2010 Election” used tweets sent by electoral candidates to build a model that was claimed would predict “a candidate will win with accuracy of 88%.” Taken out of context this might seem strong but compared with the strike rate for using incumbency as the only parameter it seems a lot of work for little in the way of tangible results, or as “How not to predict elections put it” “even when predictions were better than chance they were not competent when compared to the trivial method of predicting through incumbency”.

Tamasjan et al who carried similar work out in Germany found that twitter is used to spread political opinion discuss politics and that sentiment profiles of politicians and parties reflect nuances of the election campaign and that the mere volume of messages reflects the election result and “even comes close to traditional election poles”. It seems as if pollsters have nothing to worry about in terms of employment.

A major issue with social media data where general elections are concerned is that the people tweeting cannot be identified as likely voters. To identify likely voters a correct sample from Twitter would have to should be able to identify the age range, voting eligibility and prior voting patterns (How not to predict elections). Obtaining this information is not possible without violating the privacy of the users, a particularly hot topic of debate for social media providers at the moment. There are certainly voters who do not tweet and given the age range of likely voters in the US (in 2000 36% of citizens aged between 18 and 24 voted, 50% of citizens between 25 and 50 voted and 68% of those over 35 voted) while we have no supporting information we’ll put our reputations on the line and say as age increases in todays population the proportion of social media users probably declines while the exact opposite happens to the proportion of likely voters as age increases. This cannot be good for the accuracy of election prediction by data gathered from social media.

It should also be noted that it is easy to manipulate social media data. Far be it from me to suggest that politicians are capable of sucas this headline and exerpt from the Technology Review June 2012 demonstrates there are those who will stop at nothing to win. “Twitter Mischief Plagues Mexico’s General Election, The top contenders in Mexico’s presidential campaign are engaged in a Twitter spam war, with armies of “bots” programmed to cast aspersions on opposing candidates and disrupt their social-media efforts. This large-scale political spamming could foreshadow online antics that campaigners may increasingly resort to in other countries.”

Flu Trends:

Google Flu Trends (GFT) was launched in November 2008 and is based on the fact that Google users regularly use google to search for advice on health issues. By analysing the search terms from users Google attempts to predict flu trends.

The Swine Flu pandemic of 2009 provided the first opportunity to evaluate the performance of GFT models during a non-seasonal influenza outbreak. GFT missed it. As well as this GFT overestimated the prevalence of flu in the 2012–2013 season and overshot the actual level in 2011–2012 by more than 50%. From 21 August 2011 to 1 September 2013, GFT reported overly high flu prevalence 100 out of 108 weeks (The Parable of Google Flu Trends).

In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google Executives or the creator of the flu tracking system (The parable of the Google Flu Trends). Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (D. Butler, Nature 494, 155 (2013) & D. R. Olson et al., PLOS Comput. Biol. 9, e1003256 (2013)). This happened despite the fact that GFT was built to predict CDC reports (Parable of Google Flu Trends).

In the Parable of Google Flu Trends, Lazer et al refer to “Big Data Hubris” as being the implicit assumption that Big Data are a substitute for rather than a supplement to traditional data collection and go on to highlight that quantity alone does not mean one can ignore the foundational issues such as measurement, construct validity and dependencies among data. Like in the previous section on elections it seems that data gathered through social media does not yet compare to the tried and tested methods.

Lazer et al took

GFT’s main problem appears to be that it relies on the public to know what the symptoms of the flu are. If someone googles flu symptoms they may just have a cold.

For conclusion:

While people tweeting, expressing an opinion or searching about a product or a movie are more than likely the target market and a good indication of a future purchase the same cannot be said of elections. Where the customer has self selected as a customer a voter has not.