Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Let’s face it – no matter how true to ourselves and those around us we claim to be, to some extent or another, we still lie. Whether it’s inflating our benevolent characteristics in a survey about our lifestyle or not telling the full story about the sorts of weird stuff we do when no one is looking, everybody practices a bit of deceit in his day-to-day life.
But with the increasingly vast amounts of data that are collected through, for instance, Google searches, we can go beneath the surface and see the truth. With such an impossibly large accumulation of data on countless aspects of human existence, a pool of information collectively known as big data, it is now possible to analyze revealing patterns in our behavior and identify preferences that we never knew about before.
So what is it that big data has to offer, from reporting on the state of our health, to revealing strange human quirks, to helping conduct endless randomized, controlled experiments.
Data science is more intuitive than you think.
You’ve heard the term, but what exactly is big data? The clue is in the name. Big data refers to an immense volume of data. A volume that’s so vast, in fact, that the human mind can hardly comprehend it. In other word, big data is data for which computational power is required to recognize patterns. Paradoxically, however, despite its remarkable scale, data science has an intuitive aspect to it. After all, if you think about it, we’re all data scientists in a way.
However, although data science is an intuitive process, intuition itself isn’t actually science. That’s why utilizing gathered data correctly is essential to refining one’s worldview. Data provides us with the material to confirm or rebut our initial gut feelings. It helps us identify more precise patterns and predictions than personal experience alone ever could. Alhough a gut feeling may get us far, data refine even the most intuitive person’s perspective.
Data science is a useful tool. But what makes it special is not the amount of collected data, but rather that the data is useful; in other words, it’s the kind of data that can reveal patterns or make predictions.
Google is a case in point. Larry Page and Sergey Brin’s search engine, founded in 1998, became such a giant not simply because they were able to collect lots of data. Rather, what set Google apart was that the collected data could be used efficiently. Before Google, when you typed “Bill Clinton” into a search engine, you’d just be given websites that contained the phrase most frequently. Often you’d get a load of irrelevant hits.
Brin and Page’s algorithm worked differently. They figured out that a website was likely more relevant to someone if it had more links from other sites that took a user to it. So, Bill Clinton’s official White House website, which was the target of thousands of links, would be more useful than, for example, a site with only a hundred links, even though it might mention him by name more often.
They aggregated all that data about links and were able to spot patterns and predict what information was most relevant to the user.So why big data is so powerful. Google’s approach is a good example of the first reason: big data is entirely novel. It offers us a constant stream of new information.
Before the days of big data, you had to wait for the Bureau of Labor Statistics to collect and calculate the current unemployment rates by way of phone surveys, or wait for a report from the Centers for Disease Control and Prevention to learn about infection rates for certain illnesses.
But now, you could potentially use Google’s big data to track both of these – and that’s exactly what Google engineer Jeremy Ginsberg did. He showed that flu-related Google searches, such as “flu symptoms,” indicate of the spread of influenza, and can be used to track the spread of the disease across geographical areas and over time.
Big data doesn’t lie.
Recently, graduates at the University of Maryland were surveyed about their grade point averages, or GPA. Among the respondents, two percent admitted they had graduated with a GPA lower than 2.5 on a four-point scale. However, according to official records, the number was much higher, at 11 percent. While this is just one example, it demonstrates a universal truth about surveying: people lie.
But why? Well, it’s natural that we want to look good, both to ourselves and others, so people adapt their answers to make a more positive view of themselves. This behavior of giving answers that make us look better is called social desirability bias.
In addition, there’s often a readiness among respondents to want to impress the person administering the survey. We want to make a good impression, whether we’re anonymous or not. To take an extreme example, if you were answering questions from someone who looked like your dad, you might be unwilling to detail college drug experiences.
It’s a human propensity to tell untruths, which makes surveys unreliable when it comes to trying to understand behavior, thoughts, desires and beliefs.
This brings us to the second reason why big data is so powerful: it doesn’t lie. Because it’s collected through unfiltered online behavior, it will always reveal the truth. After all, people are far less likely to lie or skew results entering terms into search engines when no questioner is involved.
Let’s consider the subject of anal play. Would most people admit in a survey or interview that they liked using the occasional piece of fruit in their sexual fantasies? It depends on the survey, but probably not.
But when the author analyzed data from the porn site PornHub, he found that some women were searching for “anal apple.” This just goes to show that big data can reveal some surprising things about people that they might not have wanted to share directly with another human being.
Big data allows us to understand small data subsets, too.
It’s difficult to get your head around just how big big data is. Each day, staggering amounts of data fly through Google alone, not to mention other search engines or other websites in general. The volume of this data means we can now do things we could never do before.
This is the third great power of big data: the size of the datasets means we can zoom in on a subset and reliably extract information from it.
Let’s consider a real-world example. Harvard professor Raj Chetty wanted to investigate whether people thought the American dream was still alive. He decided to use big data to help form an answer to a more precise question: can people whose parents are poor grow up to become rich themselves?
Chetty’s team used tax records gathered by the US Internal Revenue Service. In total, they had more than one billion tax observations.
The data was revealing. It showed that when compared with other developed countries like Denmark and Canada, the situation in the United States wasn’t great for poor people. A poor American stood a 7.5-percent chance of achieving success in her chosen field. But for Danes and Canadians, the chances were 11.7 percent and 13.5 percent, respectively.
That was the big picture, but the beauty of big data is that Chetty could zoom in on different states, cities, towns and neighborhoods.
When he did so, he found the data revealed that the American dream did exist – but only in a few places. In San Jose, California, a poor American stood a 12.9 percent chance of getting rich, which is better than in Denmark. But for an American growing up in Charlotte, North Carolina, the chances were only 4.4 percent.
It’s this ability to zoom in that demonstrates how big data can give us a nuanced understanding of the world, wherever and at whatever scale we choose.
Big data makes A/B tests easier and cheaper to run.
Every day, we’re bombarded with stories about correlations. A food is linked with a disease. A habit is linked with success. These correlations seem credible at first, but correlation doesn’t necessarily imply a cause-and-effect relationship.
In reality, to learn about the causal effect of something, you’d need to establish causality by using randomized, controlled experiments, commonly called A/B tests. For instance, a study might report that people who drink alcohol only moderately are usually healthier. But does that imply that drinking moderately causes our health to improve? Of course not.
In order to test whether drinking moderately improves health, you’d need to have a pool of randomly selected individuals split into two groups. One group would drink, say, a glass of red wine each day, while the other group wouldn’t drink at all. After a year, the two groups would be compared. If the first group were healthier than group two, that would imply that drinking moderately was a cause of improved health.
Big data makes it a lot easier to conduct A/B tests, and this is the fourth reason why big data is so powerful.
Before big data came along, running A/B tests was highly demanding. For instance, to test a commercial’s impact, you’d have to recruit participants, survey them and analyze the results. But now, Data scientists can write a program that analyzes data from A/B tests.
Barack Obama’s 2008 presidential campaign famously employed this approach. Obama campaign directors wanted to design a website that would entice people to sign up and donate. They used different combinations of pictures and text, and were then able to analyze the relevant data to deduce which layout was most successful.
By now, we’ve established a number of positive aspects of big data. Next, let’s turn to the negatives.
Big data isn’t great with too many variables or nonquantifiable concerns.
Although there are definite advantages to big data, it isn’t flawless. Its biggest limitation becomes patently clear in datasets with many variables: it’s difficult to extract reliable answers because the number of variables obscures possible findings.
Let’s consider the work of the behavioral geneticist Robert Plomin. In 1998, he thought he had found a gene, IGF2r, that was indicative of people’s IQ. He had obtained a dataset compiled from several hundred students containing information about DNA and IQ levels. Plomin compared their DNA with low and high IQs, and found that IGF2r was two times as likely to occur in students with high IQs.
Unfortunately, the correlation was a fluke. When Plomin repeated the dataset comparison a few years later, the correlation between IQ and the occurrence of the IGF2r was nowhere to be seen.
It’s easy to see why. The human genome consists of thousands of genes; if correlations do occur, it’s always possible that they just happen through chance. There are so many variables that patterns can occur randomly.
There’s another problem with big data. It often lacks so-called small data, the kind of data that is about the human experience. Big data can measure a lot, but sometimes measurable answers aren’t necessarily what we’re after.
For instance, Facebook can easily measure clicks and likes using big data. But doing so would tell the company nothing about people’s experience with the site.
In circumstances like these, small data is essential. Facebook gathers this sort of data through other methods, namely by using smaller-scale surveys to ask users about their opinions and experiences on the site. Facebook also employs psychologists and sociologists to help the company get a sense of nonmeasurable user experiences.
This goes to show that big data isn’t perfect – and the problems go deeper still.
Governments shouldn’t use big data to target individuals.
Every time you type in a Google search or shop for a product online, you’re contributing to big data – what about the ethical considerations? What if the government had access to this data? What could they do with it?
Say someone typed “I want to kill myself” into a search engine. Should the local police be notified?
In cases like these, authorities simply don’t and can’t act on an individual level – and with good reason. Every month, there are 3.5 million suicide-related Google searches in the United States. By contrast, the number of suicides in the country is lower than 4,000 a month. This would imply a huge waste of police resources if they decided to locate the individual in question every time a suicidal thought was typed into a computer.
But there’s also an ethical dimension we shouldn’t forget. Should governments even be allowed to possess and use search data pertaining to individuals? After all, this would amount to an invasion of privacy.
These ethical considerations haven’t stopped governments from using big data on a regional level, particularly because more and more evidence points to a correlation between online searches and subsequent action.
For example, researchers Christine Ma-Kellams, Flora Or, Ji Hyun Baek and Ichiro Kawachi found in a 2016 study that suicide-related Google searches are significantly correlated with actual suicide rates. But that correlation was only valid at the state level.
So how could state authorities and police departments use such data? Well, they could use it in suicide prevention programs in specific, local areas, such as at state or municipal levels. They could disseminate information in radio and TV commercials, with tips on where to go or whom to call if people needed help.
It goes to show that, as well as revealing all sorts of interesting things about humans, big data can also be used productively in real-life situations as well.
People rarely fill out surveys honestly, which skews our understanding of the world. But with the rise of big data – that is, the collection of incredibly large amounts of data from, for example, Google searches – we are able to spot patterns in human behavior and identify preferences that we never knew about before.
Big Data provides an insightful look at why a change to “big data” is a major shift in how we collect, use and think about the data around us. It provides great explanations and examples of how individuals and companies already ahead of the curve are using the tools of big data to create value and profit. Casting an eye forward, the book also outlines the future implications for a big-data society in terms of the risks, opportunities and legal implications.