Consent and Privacy in the Context of Big Data​

Author: Bruno Rigonato Mundim

One way to elucidate the meaning of Big Data is to invoke the three following characteristics: volume, speed and variation. The first one refers to the amount of data; the second, to the time needed to analyze data; the third, to the several types of data that can be combined (Johnson 2018, 165). The conjunction of these three characteristics gives the Big Data phenomenon a surprising capacity to find patterns and connections through data that, if approached any differently, would be restricted to a very specific purpose or considered trivial. In this sense, most of the available data for the analysis of an individual has been collected for purposes not directly connected to that of the massive data analysis. Data collected in an online shopping website, for instance, with the specific purpose of enabling a specific commercial transaction, can later be used to identify patterns of consumption in certain demographics, thus allowing the creation of custom advertisement acutely aimed at a certain target audience.

For data scientists, the enormous volume of information enabled by Big Data holds the potential to offer knowledge for purposes that do not even exist yet, but can emerge through the identification of patterns revealed by the data analysis. For instance, crossing seemingly disconnected data may reveal that people who go to couples therapy are relatively likely to be in a financial situation of default; if so, such connection may be used by financial institutions when they review a loan application. This point serves as an example for the third characteristic mentioned in the paragraph above: variation. With Big Data, data travels through different contexts, causing information which we would willingly provide in one context to later be used in a different setting in which we would not agree to provide it: for example, in a doctor's appointment, we would hardly hesitate to describe our eating habits, but we would find it strange for such information to be requested in a job interview; or yet, while some data may seem innocuous on its own, when analyzed and confronted in a certain way it may expose connections whose consequences were not foreseen by those who consented to provide it.

When we take into account this fluid, inter-contextual aspect of Big Data, the matter of privacy deserves particular attention. One way to try to guarantee anonymity is by suppressing personal data. For instance, a hospital, when disclosing patient data to a research institute that intends to carry out a statistical survey on a specific disease, could eliminate information such as names, phone numbers, addresses or ID numbers. However, by connecting data from different contexts - from a drugstore purchase to a social media profile - in many cases it is possible to reidentify, with considerable accuracy, the individuals whose personal information was previously suppressed in a context where anonymity was a goal.

Ohm (2010, 17) describes an event that took place in Massachusetts and illustrates how the combination of data sourced from different contexts can be used to identify information that would otherwise remain anonymous. The Group Insurance Commission (GIC), a Massachusetts government agency, had purchased health insurance for state employees. Around the mid-1990s, the GIC decided to make available, to any researcher who requested it, the records that summarized all hospital visits made by the insured employees. In order to protect patient privacy, all information considered an explicit identifier, such as names, addresses and Social Security numbers, was removed. When the data was released, William Weld, then-governor of Massachusetts, reassured the population that patient privacy would be preserved since all identifiable information had been removed. However, this was not enough to prevent Latanya Sweeney, a graduate student at the time, to find out which of the hospital records disclosed by the GIC belonged to Governor Weld. Knowing that he lived in Cambridge, Sweeney bought the town's electoral register for twenty dollars. Among other things, it listed the address, ZIP code, date of birth, and sex of every person who was qualified to vote in a certain electoral district. After crossing this data with the GIC records, which contained the date of birth, ZIP code and sex of each patient, Sweeney concluded that only six people in Cambridge had the same age of the governor, out of which three were men and only one had the same ZIP code. She then sent to the governor's office his own medical records containing diagnoses and prescriptions.

The date of birth, sex and ZIP code of each patient could have been removed from the database disclosed by the GIC, which would reinforce the guarantee of anonymity. However, such information could also be highly relevant to the development of medical research. This, as Ohm (2010, 4) emphasizes, reveals a tension around data privacy: one can choose between useful or completely anonymous data, but never both.

Examples like this show that the potential consequences of massive data analysis remain unclear to us, especially due to the lack of precise determination of how data should move around - in other words, it is difficult to restrict it to a certain context. Therefore, data anonymity does not fully exempt us from ethical problems related to privacy. Information that we sometimes don't mind revealing in a given context can become a key element in a data analysis aimed at breaking the anonymity of other information provided in another context [1].

Aware of how easily data technology can combine information from completely different databases, Nissenbaum (2009) argues that privacy must be understood in light of what she understands as contextual integrity. As each context has specific regulations over the data it possesses, privacy is violated once the data sourced from a given context ends up in a context where it is not appropriate. It may not seem strange, for example, for data related to medical expenses to be used for income tax credits, but if such information is accessed by insurance companies who use it to create custom health insurance prices, we may feel that our privacy has been invaded.

Unrestricted data combinations that are left unchecked as they flow through diverse contexts can feed questionable purposes that directly affect our lives. Data is combined with the goal of detecting patterns and connections in order to outline profiles and predict certain behaviors, offering a valuable source of information that can be employed by managers in their decision-making - from a loan application review to a health insurance assessment, approval of a lease contract or production of an advertising campaign.

Another emblematic case involving US-based retailer Target offers a look into how data analysis is capable of obtaining sensitive information about our behavior (Duhigg 2012). This story is set in a Target store in the Minneapolis area. A man furiously addresses the manager: “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager had no clue of what the man was talking about, but after seeing the coupons, he quickly realized that large amounts of ads targeted at pregnant women were being sent to the young girl. He apologized. Days later, when he called to apologize once more, an embarrassed father picked up the phone and revealed: “I had a talk with my daughter. It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

At the time, the advertising campaign was run by mathematician and statistician Andrew Pole, who gave an interview to Charles Duhigg, the author of the news article mentioned above. Pole had an extensive database of transactions made by Target shoppers - whose records were individually associated to a code called the Guest ID - as well as data bought from other sources: “If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an e-mail we’ve sent you or visit our Web site [sic], we’ll record it and link it to your Guest ID,” Pole said. “We want to know everything we can.” The data collection does not stop there, adds Duhigg:

Also linked to your Guest ID is demographic information like your age, whether you are married and have kids, which part of town you live in, how long it takes you to drive to the store, your estimated salary, whether you’ve moved recently, what credit cards you carry in your wallet and what Web sites [sic] you visit. Target can buy data about your ethnicity, job history, the magazines you read, if you’ve ever declared bankruptcy or got divorced, the year you bought (or lost) your house, where you went to college, what kinds of topics you talk about online, whether you prefer certain brands of coffee, paper towels, cereal or applesauce, your political leanings, reading habits, charitable giving and the number of cars you own.
With this data at hand, Pole and his team, in order to find a relevant pattern, analyzed the purchase history of all women who had registered at the Target list for baby articles. They noticed that pregnant women purchased larger amounts of unscented lotion and that in the first twenty weeks of pregnancy the future moms usually stocked up on supplements such as calcium, zinc and magnesium. They also realized that, when someone increased their purchases of unscented soap, large packs of cotton balls, hand towels and hand sanitizers, it was a sign that they were close to giving birth.

The analysis of this data led to a list of twenty-five products that formed a score that ranked the possibility of a pregnancy. The score was precise enough to enable a forecast of when the woman was due, allowing Target to tailor its ads to specific phases of the pregnancy. One of the store employees who spoke to Duhigg gave the following example: "Take a fictional Target shopper named Jenny Ward, who is twenty-three, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant and that her delivery date is sometime in late August."

Anticipating these events increases the chances of Target achieving fidelity of pregnant customers before its competitors. As the article explains, because baby records are public, as soon as a child is registered the parents are targeted by a myriad of ad campaigns by a number of companies. This way, by being able to estimate the pregnancy phase of its customers - more specifically the second trimester, when most motherhood-related purchases happen - Target would be able to reach them before its competitors. And once they arrived at the store in search of baby merchandise, these shoppers would be exposed to the retailer's entire collection, ranging from DVDs to pet supplies, which boosted the chances of alluring them into becoming patrons.

An interesting issue that the Target case highlights is the fact that overall we have no clear notion of how we are being monitored, although we do know that we are being monitored at all times - with every click, email, credit card purchase, etc. For this reason, Johnson (2018, 168) argues that a theory of surveillance based on the metaphor of the panopticon is less than appropriate in the era of Big Data. According to this metaphor, as in a panopticon prison, individuals adjust their behavior to the rules of those who observe them. Although they don't know if they are in fact being monitored, the possibility alone already has the power to impose a certain conduct on them. Those who are observed have a clear comprehension that each behavior leads to a specific consequence, which allows them to adjust their conduct in order to prevent certain sanctions. This clarity is lost when an observation purpose is paired with statistical analyses searching for patterns in an enormous amount of data. After all, it may not be a mystery that companies have access to our shopping list, but the fact that choosing a certain kind of body lotion says something about our family situation may seem too far-fetched to us.

The challenge, therefore, is that we are unable to identify which of our data classifies us in a pattern investigated by a given data analysis. In light of what data science has been able to do, we don't know whether our preference for a specific coffee brand or our Spotify history may influence a job interview or a visa approval, for instance. If we knew, we could try to manipulate this data in our favor.

CONSENT

Behavioral research is at its peak, as pointed out by Eric Siegel, a consultant and president of the conference Predictive Analytics World (Duhigg 2012): "We’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now." Much of this is due to the possibilities offered by Big Data, which, through ever more sophisticated data analyses, sheds light on how our habits influence our decisions. However, the average user of the technology that collects and feeds large databases has little clarity of what can be done to the data they provide, despite the fact that it can be used against them. Overall, the source of information regarding the use of data is restricted to privacy policies, which users must consent to if they wish to utilize a certain product or service. Therefore, considering that consent plays the role of a moral converter, i.e., it has the power to transform things that, if not consented to, would be morally reprehensible, it is paramount to investigate the conditions in which consent is given to the collection of personal data.

Bullock (2018) considers the moral purpose of consent from two perspectives, which she calls procedural and substantive. According to the former, the power of morally transformative consent is determined by the autonomy of an individual: as long as consent is given by an autonomous, willing individual, such concession is legitimate, although certain ensuing consequences may compromise the well-being of the grantor. From the latter perspective, despite the concession having been ratified by an autonomous, willing individual, the violation of the individual's well-being is the determining factor; therefore, consenting to something that compromises your own well-being - even though you are aware of it - invalidates the morally transformative role of consent.

Let us illustrate these points through an example (one that is somewhat extreme, but clearly marks the limitations of each perspective). It is not unreasonable to presume that deliberately inflicting physical pain on someone - although the perception of pain involves a high level of subjectivity - is a morally (and legally) reprehensible attitude. That being said, if somebody consents to being tortured, how can we understand this? From a procedural perspective, as long as consent stems from the autonomous will of its grantor, it exempts the torturer from all liability. In other words, consent fulfills its morally transformative role. However, if we consider the same example from a substantive perspective, it does not matter if an autonomous, willing individual has consented to being tortured, because such an act violates their well-being.

From a procedural perspective, still according to Bullock (2018, 86), valid consent has three procedural requisites: it must be voluntary, informed and decisionally competent. Consent is no longer voluntary when a decision is influenced through inadequate means such as manipulation or coercion. For instance, a person who consents to sexual relations at knifepoint cannot be said to have made a voluntary decision. However, not all forms of influence on consent tamper with the voluntary aspect of the action. Rational persuasion, such as when a doctor explains to a patient the benefits of a surgical procedure, is often accepted as a legitimate method to obtain valid consent. Regarding the second requirement, consent is understood as uninformed when the grantor is unaware of what they are consenting to or has false beliefs about the implications of their decision. Consider, for instance, someone who purchases an internet plan but later finds out that they must pay additional fees to make full use of the service. Consent is invalid because, if the person were aware of such fees, they would not hire the service - in other words, information that is ignored or absent is relevant enough to annul consent. Finally, a person is considered decisionally incompetent to consent when they are incapable of understanding what is at stake, how the conditions apply to them and therefore make a choice. This can be due to mental disability, the effects of narcotics or age, among other things.

Based on this more general, philosophical notion of consent, we can try to elucidate what happens in information technology. In this realm, consent primarily related to the permission of use of our personal data in exchange for the use of a given service. Among the three procedural criteria, information seems to be the most sensitive one. First, there is practical infeasibility: If we were to read all privacy policies that are commonly presented to us in order to only then express informed consent, we would have to dedicate an average of 244 hours per year to this task (Custers 2016). In addition, these texts are often difficult to understand as they involve technical details or contain legal jargon. It is thus no surprise that, as several studies point out [2], people hardly ever read privacy policies.

Taking into account the constant advancements made by information science, there is yet another problem related to the same criteria: the difficulty to predict the potential use of the data that is provided. Even if we have a clear understanding of a privacy policy and agree to it, we cannot predict what information technology can do to data that at first seemed innocuous to us. On this matter, let us come back to Pole, the Target statistician behind the ads directed to pregnant women (Duhigg 2012): “If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable. We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.”

Etzione (2012, 931), while describing a process with the suggested name of "privacy violating triangulation", explains how data under the protection of privacy laws can be obtained through "innocent facts" not safeguarded by legislation:

A piece of seemingly benign information — for instance, the number of days a person failed to show up for work, or if the person made special purchases, such as a wig — suggests volumes about one's medical condition. By building a portfolio of many such apparently innocuous facts, one could infer a great deal, effectively violating the realm of privacy surrounding individuals' most sensitive information.
Thus, in addition to the fragility of privacy-related laws, we must also take into account our incapacity to achieve a precise notion of what we are really consenting to when we agree to a privacy policy, even though we put in as much effort as we can to make an informed decision.

Faced with these problems, individuals who submit to the demands of consenting to privacy policies display noticeable disengagement: the "Agree" boxes are only checked to allow access to the service or product as quickly as possible, which deprives this act of the moral status of an autonomous, conscious decision. This traces back to the procedural criteria of voluntary consent. There is no exaggeration in saying that there is a kind of tacit coercion in the way consent proposals are presented. Pressured by the practical difficulties pointed out above and the threat of losing access to the desired (and often needed) product or service, people are induced to consent to privacy policies - not to mention the cases in which they are offered additional advantages such as discounts and gifts. In addition, even though we come to an informed opinion, we cannot express it in a more detailed manner because either there is consent or there is not. Custers et al. (2018, 253) highlight the difference between accepting terms and conditions and agreeing to their content. While the former relates to legally valid consent, the latter questions the effectiveness and meaning of consent mechanisms. We often accept a privacy policy only to dodge the obstacles that prevent us from fulfilling a pressing desire, which is different from really agreeing to those terms in an informed, voluntary decision.

Regarding the last procedural criteria of valid consent, which is related to the competence of the grantor, the use of children's' personal data is a strong example. How can these cases be dealt with, considering the level of maturity necessary to assess the implications of the use of the data and thus give valid, decisionally competent consent? Motivated mainly by the growing digital marketing techniques targeted at children, COPPA (Children's Online Privacy Protection Act) [3] determined that the minimum age to give consent is thirteen. Below that age, the use of data is only legal following verified parental consent, which means that the institution that will access the data must commit to a "reasonable effort" to guarantee that the consent has in fact been given by the parents. However, not only does this not guarantee effective protection of such data due to the issues pointed out above, but it also may prompt invasive monitoring by parents, which is yet another form of privacy violation (Custers et al. 2018, 251).

It is preferable to adhere to a definition of consent closest to what Bullock calls procedural. The other perspective, as it involves the well-being of the grantor, imposes challenges that deserve separate treatment. Could the invalidation of consent, when it violates an individual's well-being, be in itself a violation of well-being, since the individual's autonomy is affected? The question of euthanasia, for instance, allows us to envision the countless problems that may ensue from this notion. After all, objectively describing the notion of well-being can be a highly complex task.

Apart from any conclusions, certain measures that have at least mitigated the problems discussed in this article are worth mentioning. One of them is the right of being forgotten, which consists in mechanisms to cancel consent and consequently delete any data that has been provided up to that point (Custers et al. 2018, 252); preventing data collected in a given context to migrate to other contexts could hinder the extraction of information as data is crossed; partial consent would grant greater autonomy to individuals and be less coercive, as they would not fit into the logic of "all or nothing", making sections of services available as consent is broadened (Custers et al. 2018, 255); calculating an expiration date for consent could protect us from future, undesired consequences, considering that technological progress makes it difficult to predict what can be done to our data (Custers 2016).

Finally, even if we are careful about disclosing our personal data, it is possible that it is indirectly deduced. This can happen either from our own behavior online - "Easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender" (Kosinski et al., 2013) - or from people connected to us who are not as careful (take, for instance, friends' posts that reveal information that we would not reveal ourselves). As stated by Custers et al. (2018, 252), "The use of Big Data increasingly enables predicting characteristics of people who withheld consent on the basis of the information available from people who did consent."