Don't live a lie

IIM Calcutta dean, Biju Abraham, highlights the pros and cons of the big data revolution discussed by 'Everybody Lies' author, Seth-Stephens Davidowitz

Published 6 years ago on Mar 03, 2018 4 minutes Read

The election of Donald Trump and the victory of the ‘Leave’ campaign in the UK referendum on EU membership were both unexpected. Most opinion polls had predicted that Hillary Clinton would win and that Britain would vote to remain within the EU. Both these failed attempts at prediction raised a pertinent question —did pollsters ask the wrong questions? Not so, says Seth Stephens Davidowitz in his book Everybody Lies: What the Internet Can Tell Us About Who We Really Are. They asked the right questions, but the respondents often lied when they answered. The pollsters, he feels, should have looked for data online, which reveals far more about voters’ behaviour and preferences with greater accuracy. The book is a paean to the revelatory power of big data contained in Google Trends.   

The Internet, Davidowitz argues, has revolutionised data science to such an extent that it is in effect a ‘digital truth serum’ that enables us to understand human behaviour with an accuracy unimaginable until a few years ago. From revealing outcomes of elections long before polling day to identifying regions where child abuse has increased, the Internet, he believes will transform data analysis and enable us to discern people’s attitudes and behaviour with much greater precision than ever before. His premise is a very simple one. People often lie when asked about their preferences. They reveal their true selves when they are surfing the web. Analysing what they search for on the Internet reveals not just their true beliefs, but also indicates how they are likely to vote, what they are likely to buy and how their social behaviour might evolve. 

Davidowitz, it must be admitted, is persuasive. For example, he cites the order of Google searches of candidate names and relates it to the actual outcome. There were far more searches for ‘Trump Clinton’ than ‘Clinton Trump’ in US Midwestern states, which finally decided the election by giving Donald Trump the electoral college votes needed to win. He also demonstrates that though data reported by child protection services in the US did not show an increase in cases of child abuse during periods of high unemployment following the 2007 financial crisis, Internet search patterns reveal that searches relating to child abuse increased substantially then. These were possibly search requests by children, and adults suspecting abuse. He attributes the failure to report abuse to job losses and overwork among those likely to report child abuse, such as teachers and police officers, and among those who deal with abuse such as child protection workers. The author argues that you don’t need armies of pollsters and government officials to collect and analyse data to figure out socio-economic trends. The data is available with Internet search firms such as Google.  All that is needed is smart researchers, who can frame the right questions to analyse the data and come up with the right answers.

Davidowitz also discusses the limitations of Internet search data. The first is the ‘dimensionality problem’ with data. Data, more often than not, might portend different outcomes based on context. While an increase in search requests for ‘iPhones’ is a good indicator of an increase in sales of iPhones, increasing searches for ‘GOOGL’ (the stock code for Google) might only signal increased trading in the stock. The stock might very well go down, rather than up, during a heavy trade. While search patterns might indicate group or societal behaviour it says very little about specific individuals. Not all those who search for “how to kill your girlfriend” or “how to commit suicide” are intent on committing murder or suicide and it would be foolish for government agencies to target them based on their online search activity.

The book, despite a ‘Conclusion’ that promises much, is unfinished (based on data which reveals that very few readers read an entire book!) but is a very good introduction to the potential and limitations of big data analysis. If you are not very familiar with big data and its benefits, then this is a book that can explain its immense potential.