(Image: “Pensiero” by ​​​​​Ilaria Parente)

A raising number of voices claim that we can now understand society, literature, and art, using ‘big data’ analytics, fostering epochal perspectives (or ‘epochalistic’ according to Savage) on how the recent ‘data deluge’ will impact society and research, and possibly change our life – as discussed by Mayer-Schonberger and Cukier in Big data.

Some of those claims follow the line set by Anderson, as he suggested “the end of theory”, minimizing possible issues of these methods, and subordinating them to a higher common good or economical advantage. Questions have been raised about how companies perform big data analytics and use it. Other authors are even more cautious, other extremely skeptical. In The data revolution, Kitchin highlights how current applications of big data analytics raise issues of quantification, which are as old as science. In Critical Questions for Big Data, boyd and Crawford highlight the risk “re-inscribing established divisions in the long running debates about scientific method and the legitimacy of social science and humanities”. In a recent interview, Frédéric Gros discusses how Foucault’s legacy fits into the discussion, for what concerns categorization, norms, and expression of freedom, in the era of traceability and algorithmic profiling.

In my opinion, big data analytics fosters de-contextualization of information, thus the introduction of some level of informational entropy (i.e., a state of uniformity, as discussed by Floridi – here, I am not referring to thermodynamic entropy, nor to Shannon’s uncertainty), and therefore it should be handled with extreme care. For instance, in analyzing large text collections for word occurrence (e.g., Google Ngram), words in text are interpreted as equivalent, uniform – whereas they might not be, not just due to their context in the text, but also depending on when, where, by whom, or why they were written (see e.g., A conversation with data by Gibbs and Cohen).

The field of natural language processing has long studied the problem of recognizing named entities  (i.e., people, organizations, events, etc.) in text. Being able to automatically categorize some words can be extremely insightful, and software tools developed to perform this task claim very high accuracy in specific fields (e.g., journalistic text corpora). Unfortunately, their applicability to general text is challenging, and it has been shown that their scores drop significantly when just applied to different corpora in the same field. Recent studies tried to apply those same tools to tweets, achieving rather disappointing results. Ad-hoc tools are being developed to parse tweets, obtaining far better results, but on rather small corpora counting few thousand tweets.

However, these tools still can’t resolve ambiguity. For instance, correctly assigning a toponym (i.e., a place name) recognized in a text to a location is a big issue in geographic information retrieval. A good example is ‘Washington’, which can refer to both Washington D.C. and the Washington State (geo-geo ambiguity), but also to George Washington (non-geo-geo ambiguity) or Denzel Washington, as well as the government of the United States (metonymical reference ambiguity). Other issues are related to vague and vernacular toponym, which can’t be easily associated to geometric boundaries, such as ‘east coast’, ‘Alps’, and ‘downtown’. Even concepts like ‘near by’ or ‘north of’ can be very challenging to handle computationally.

More complex methods from the field of artificial intelligence can be used (e.g., deep learning), aiming at a full semantic understanding of a text. That level of understanding won’t probably be achieved, at least in the very near future, as it would require to correctly associate words in the text and find a set of related concepts that maximize the coherence of the overall theory. This is a very hard problem, given the vagueness and ambiguity issues discussed above. Moreover, these methods would also require a solid knowledge base, which can be large, but can’t just be exhaustive of the whole human knowledge and cultures – which are also vague, contradictory, and sometimes antithetical. Even assuming one monolithic set of ‘truths’, and assuming a knowledge base can be formalized in terms of axioms and functions, Gödel’s first incompleteness theorem would apply, and thus it could be either consistent or complete. That is, either it contains contradictions and thus it is undecidable, or there are truths about the encoded knowledge that can’t be proved within the knowledge-base. That’s not to say that humans can contrariwise understand every text posed in front of them, but rather that no, no company knows you better than you know yourself – whereas these memes and practices increase the risk of self-censorship.

That is to say, results of big data analytics aren’t reasoned understandings of information, but descriptive statistics or (frequently linear) models of measured values. These can be still valuable for some purposes, but they might also be misleading, depending on their production and usage. It’s a matter of how much approximation and error one is comfortable with in one’s analysis, particularly as those analysis are now frequently applied back directly to the source of data, in the form of recommender and filtering systems.

Nonetheless, big data analyses frequently result in definitive statements, claiming objectivity and thus a higher level of legitimacy in establishing ‘truths’ about society and culture. However, although being possibly very detailed in describing certain phenomena, big data can’t capture a whole of a domain, nor capture it in full resolution. There is a risk of measuring things just because they can be measured, or because they are already available, rather than following their relevance for a study. This can lead to “partial orders, localised totalities” and “to gaze in some directions and not others” (references to Latour, by Amin and Thrift, cited by Kitchin in Code/Space).

Anderson suggested the idea of leaving “the data speak for themselves”, which leaves no room for interpretation and scientific reasoning. The “end of theory” also promote the controversial idea that algorithms come without underlying assumptions, whereas algorithms search for patterns following structured methods imposed by a programmer, and thus based on assumptions, hypothesis, and theories.

Bowker highlights the risk of data-program-data cycles, where “if something is not in principle measurable, or it is not being measured, it doesn’t exist”, and the dangers posed to identity and definition of the self: “as people we are […] becoming our own data [and] if you are not data you don’t exist […] and it doesn’t matter how often you declare yourself alive”.

However, I think that big data and the related analytical approach can be an interesting step to take during a research project, within a broader research path.

These quantitative methods can be used to explore large datasets, searching for commonalities as well as outliers. This can drive further research question, or criticize established ‘truths’, complementing but not superseding the qualitative methods in social science and humanities research. Social media content in particular should not be understood a window into people’s minds, but rather as a mode of expression that has its own affordances, rules, limitations, demographics, and geographies – that is, a phenomenon per se, rather than as evidence of phenomena, as suggested by Wilson in Morgan Freeman is dead and other big data stories.

Domain-specific knowledge might be the key to decrease the level of entropy created by de-contextualization of information, as a better understanding of the domain in which an algorithm is applied could lead to a better understanding of its results and limitations, or to a more domain-specific approach. As advocated by Lock in The Spatial Humanities and by Huggett in Core or Periphery? in the field of digital humanities – after the ‘computational turn’ in arts, humanities, and social science, we might really need a ‘humanist turn’ in big data analytics and computer science. A better understanding of society, literature, and art might not come from more data or faster computational tools by themselves, but it might spring from more cross-disciplinary collaboration.