Reading the Cloud: Scenes of Life in the Age of Big Data
By José Luis de Vicente
On November 26, 2009 the well-known website Wikileaks, which publishes confidential information on the Internet, released some particularly sensitive data – a set of half a million private messages intercepted from commercial text messaging services and pagers, a pre-cell phone technology that is now virtually obsolete. The messages, which were several years old by the time they were published, had all circulated through US telephone networks on one specific day – a date instantly recognizable by anybody on the planet: September 11, 2001.
Analyzing the data set released by Wikileaks proves enlightening, revealing as it does the hybrid nature of the data that circulate through digital infrastructures. As well as person-to-person messages (“PLEASE CALL YOUR DAUGHTER IMMEDIATELY”), there are user alerts generated by systems, and messages that are not understandable to humans – traces of the invisible communication between two computer systems.
In the course of the morning of that November 26, Wikileaks recreated the events of 9-11 in real time, releasing the messages minute by minute. The text messages expressing of confusion and horror that had circulated through pagers on that day in September 2001 were, improbably, re-embodied in a computer system that did not yet exist on the day of the attacks: Twitter. Users of this popular microbloogging network filtered through the thousands of messages to unearth the most striking, labeling them with an identifying hashtag: “#911txts”.
True to form, Wikileaks never revealed the source of the data or the person behind the leak. But thanks to that unidentified person, an invaluable document will be available to historians of the future: a second-by-second account, from thousands of individual points of views, of the defining event of the early years of the twentieth century.
The fact that half a million private messages from an outdated technology continued to be held in some government datacenter or telecommunications company could not be more contemporary. Data generation and storage is a booming industry, and people’s day to day social activity has become one of the largest sectors producing information that can be stored and preserved.
“Data Tsunami” is a term that describes the widespread feeling that our capacity to generate and store information has grown exponentially over the last ten years, to almost unmanageable magnitudes. The quantitative leap that occurred between the first iPod launched on the market in 2001 (5 GB capacity, space for 1200 songs) and its equivalent in 2010 (160 GB, 40,000 songs) can be transposed to any digital system that collects, processes and stores information.
Information storage is now so simple that when it comes to deciding the most profitable way of determining what information is worth keeping and what isn’t, the cheapest option is to store all of it, for ever, with the only restriction being data protection laws. On a social network like Facebook, with more than 400 million users, every individual interaction, every photo tagged, and every gesture expressed with an “I like this” is preserved indefinitely. 
Once we have crossed a certain threshold, the size of a data set places us in a different world. When the unit of measurement is the petabyte (a million gigabytes, or the amount of information processed by Google servers every 72 minutes), we are faced with a scenario that is new to scientific methodology. The challenges entailed in structuring and representing information in the age of “Big Data” are new, but so are the conclusions we can arrive at by analyzing it.
Data mining – the process of detecting meaningful patterns in information structures – has thus become an essential technique for interpreting reality. Collectives like RYBN see the mass availability of data as a raw material that they cannot ignore. In “Antidatamining”, they subvert the techniques and tools employed by financial data analysts, using them to generate a visual interpretation of key processes in the systemic functioning of the global economy. A similar fascination with the visual codes of data structures can also be seen in “Still Living,” a series of installations by Antoine Schmidt that sublimate the signifier – pie charts, bar graphs, curves – by deleting all reference to the signified.
If you save your e-mail messages in a G-mail account rather than your hard drive, use Spotify to play the songs you listen to, or store your photos on Flickr, your data lives on "The Cloud". “The Cloud” is the term coined by the Internet industry to refer to services in which data is not stored on the user’s computer, but on remote servers that are instantly accessible from any device. The metaphor is particularly misleading, because this "cloud" is neither diaphanous nor intangible.
The infrastructure required to keep the Cloud functioning entails dozens of industrial premises scattered around the world, enormous warehouses that contain hundreds of processors, storage systems and electrical and cooling infrastructure. Companies like Google, Facebook and Microsoft are known to own numerous datacenters in different countries, although the details on their exact numbers and locations are confidential. These datacenters are the factory architecture of the information society. Data factories that try to slip under the radar, with their owners preferring to avoid giving out too much information about them.
What we do know is that data distribution and preservation is a booming industry, and that its principle raw material is power. According to New York Times journalist Tom Vandervilt, a datacenter has more things in common with a machine than with an ordinary building. It consists mostly of electrical infrastructure that keeps the equipment operating, and cooling infrastructure that prevents it from overheating.
Even though it is in the interests of those who own and design datacenters to optimize their energy efficiency, their power consumption is nonetheless very high. Current estimates suggest that the Cloud already consumes 2 per cent of the world's electricity, more than a country like Sweden. Around sixty per cent of this power is used to lower the temperature of the equipment.
In Helsinki, a former bomb shelter located below Uspenski Orthodox Cathedral will soon house a new datacenter. Rather than try to eliminate the heat emitted by its hundreds of continuously operating servers, Finnish company Academica is developing a system of pipes that will capture this waste energy and use it to heat a residential area of a thousand homes. Under the Cathedral, comments written in blogs, orders placed in online sales systems and holiday snaps, will keep the residents warm at night.
The Self, Quantified
Nicholas Felton is a New York graphic designer with something of a cult following on the Internet. Every year, after the Christmas holidays, Felton meticulously crafts a document that he releases through his Web site: the Feltron Report. The Report is an exhaustive record of Felton's activities throughout the preceding year – the number photographs he took, the restaurants he ate at most often, the songs he played most on his iPod. In 2009, he decided to publish a special monograph on his encounters with other people over the course of the year. And so, the 2009 Feltron Report specifies that 255 of the 1761 meetings between Nicholas and a relative, friend or work colleague took place in a restaurant, while another 60 were at a museum, art gallery or concert venue. Thirty different movies and twenty-five music groups were mentioned in their conversations. The brand of beer consumed most often was Stella Artois.
Nicholas Felton is not alone in his love of methodically recording his actions. As a matter of fact, a whole economy of digital products and services has sprung up around making life easier for people who want to compile and analyze data about themselves. Digital culture pioneer Kevin Kelley is one of these people, which is why he co-hosts The Quantified Self, a blog that offers “tools for knowing your own mind and body.” Visitors can find resources like Fitday.com, an online journal where dieters can keep a record of their progress as they try to lose weight; My Every Move, a location based application for GPS-enabled cell phones that can tell you exactly where you where at any specific moment; Monthly Info, which helps women keep track of their menstrual cycle, and Bedpost, on which users can store the when, how and who-with of their sex lives. Even Nicholas Felton tried his luck in the self-statistics industry, releasing Daytum, a sartup that offers a simple, flexible tool for being Nicholas Felton.
All of this probably seems eccentric at best, and certifiable at worst. But in a sense, without making any special effort, all users of the Social Web are already Nicholas Felton to some degree. Probably without even noticing, they simply outsource this task of quantifying and measuring the patterns of their daily lives.
Bruce Schneier, perhaps the world’s best-known computer security expert, explains how the generation of personal information is an inevitable fact of the omnipresence of digital architectures.
Welcome to the future, where everything about you is saved. A future where your actions are recorded (...) and your conversations are no longer ephemeral. A future brought to you not by some 1984-like dystopia, but by the natural tendencies of computers to produce data.
Data is the pollution of the information age. It’s a natural by-product of every computer-mediated interaction. It stays around forever, unless it’s disposed of.
Thus, when we add contacts to our Facebook account we are giving structure to our social life. When we positively rate a song on a radio on demand service like Last.fm, we contribute to building a model of our music preferences, which the system will compare with those of all other users. If we enter details of our travel plans for the next few months in a Dopplr account, the system will calculate our environmental footprint, and it will also collect information about what kinds of people visit certain cities, and when.
What we are doing is “parsing” our everyday life – giving it a format that machines can understand. There is probably hidden value in the overall aggregation of all these individual decisions and stray data. The study and analysis of this vast storehouse of social processes may, perhaps, teach us something about ourselves.
In his “conversation maps”, Warren Sack plots and represents the connections between individual contributions on the Internet linked to a particular topic of discussion. His data visualizations connect threads with dozens of participants that spread through blogs, USENET groups or mailing lists. This cartographic work highlights the fact that the web can also operate as a public realm in which “the public of the network society can understand itself as a political body".
Shadows and Tracks
“The consequences for the social sciences will be enormous: they can finally have access to masses of data that are of the same order of magnitude as that of their older sisters, the natural sciences”.
Bruno Latour. Beware, Your Imagination leaves digital Traces
At the 2006 Venice Biennale of Architecture, the architect Carlo Ratti publicly unveiled a map of Rome that revealed an entirely new dimension to the city. In “Real Time Rome” we see movement markers and differently colored areas projected onto the city's streets, squares and avenues. These projections reflect the activity taking place on the city's cell phone network as its inhabitants go about their daily lives. It’s fluctuations show day-to-day rhythms, which are reflected in the layer of electromagnetic signals that hover above the city. In a later work, Ratti and his research team from MIT SENSEable City Lab showed the same network during an eagerly awaited football match, the final of the 2006 World Cup that pitted Italy against France. It is not difficult to read the game and the city’s behavior as though it were an organism, through the variations in the intensity of cell phone activity: phone calls are negligible during the actual game, numerous at half-time, and very high at the instant of Zidane’s famous headbutt to Mazzerati, and at the moment of Italy’s victory.
SENSEable City Lab and other similar research units are testing the hypothesis posed by Bruno Latour: if our everyday social activity generates a trail of data, and these trails are stored, preserved and organized, what can we learn about our collective behaviors and the laws that underlie them?
A possible optimistic response would be the Google Flu Trends project, which had a major media impact when it was made public in November 2008. In this initiative, researchers at the search giant compared official data from the Center for Disease Control on Flu virus infection rates in each city in the United States, with the Flu-related searches carried out by Internet users on Google. The comparison showed a clear correlation – early flu symptoms produce a burst of searches related to the illness – but it also showed that the data reached Google two weeks earlier than through official channels. This revealed the potential of large-scale aggregation of personal data as a method for accessing collective knowledge that cannot be reached through other means.
But as soon as personal data are preserved and aggregated, all kinds of uncomfortable questions start to crop up: who owns them, what are they used for, who has the right to access them. While Google defends the need to extend the maximum period for legal data retention so as to allow potentially valuable research like Google Flu Trends to be carried out, governments in authoritarian states like Dubai demand that telephone companies hand over copies of their customers' SMS in order to convict unfaithful wives for endangering public morality.
When we cross a certain threshold, we enter a different world. And while Internet rights activists engage in an important battle to ensure that the industry growing around the preservation of social processes does not bring about the end of privacy, almost all the questions concerning its long-term implications are yet to be answered.
 A the time of writing (March 2010), the data is not available on Wikileaks, as the service was forced to withdraw most of its content because it could not afford its broadband expenses. The project is seeking donations that can help cover its maintenance fees.
 Recently, the online magazine The Rumpus published a conversation with an unidentified Facebook employee. It is available on: http://therumpus.net/2010/01/conversations-about-the-internet-5-anonymous-facebook-employee/
The following fragment is of particular interest:
The Rumpus: On your servers, do you save everything ever entered into Facebook at any time, whether or not it’s been deleted, untagged, and so forth?
Facebook Employee: That is essentially correct at this moment. The only reason we’re changing that is for performance reasons. When you make any sort of interaction on Facebook — upload a photo, click on somebody’s profile, update your status, change your profile information —we definitely store snapshots, which is basically a picture of all the data on all of our servers. I want to say we do that every hour, of every day of every week of every month.
 Tom Vanderbilt, “Data Center Overload”. http://www.nytimes.com/2009/06/14/magazine/14search-t.html
 Robin Pagnamenta, “Computer power provides heat for Helsinki” http://business.timesonline.co.uk/tol/business/industry_sectors/natural_resources/article7022488.ece
 Bruce Schneier, “Privacy in the Age of Persistence” http://www.schneier.com/blog/archives/2009/02/privacy_in_the.html
 “Airline Crew in Dubai Jailed for Sexting” http://www.myfoxspokane.com/dpps/news/dpgoh-airline-crew-in-dubai-jailed-for-sexting-fc-20100318_6628380