Yesterday marked the tech industry’s advocacy action, The Day We Fight Back. This campaign is aimed at raising awareness of the National Security Agency’s (NSA’s) mass surveillance program which is collecting content and metadata from e-mails, chats and social networks; harvesting contact lists; and gathering billions of records each day on cellphone locations, according to The Washington Post.
The campaign also celebrated the defeat two years ago of the Stop Online Piracy Act (SOPA) in the United States, which demonstrated the power of community action to defend citizen data and Internet content rights. The SOPA aimed to prevent access to sites that allow for user-generated content.
One of the worrisome aspects of the legal situation in the United States—and, increasingly, in the international legal context as well—is that government and company rights to data surveillance are being enshrined in prescriptive legislation while individual citizen and independent business rights are considered less of a priority.
The most obvious disparity in these rights is at the center of The Day We Fight Back’s mission statement: to fight against the unconstitutional surveillance of U.S. and international citizens by the NSA.
But just as insidious is the way in which some large corporations and governments are using poorly written legislation to enforce criminal sanctions against individuals developers sourcing data from the Internet. The Computer Fraud and Abuse Act (CFAA) allows criminal prosecution for anyone who “exceeds authorized access.” The legislation was a key factor in the prosecution of Aaron Swartz, the developer activist who took his own life after being criminally charged under the CFAA. Swartz devoted his young life’s work to supporting open access to data and scientific research, and was a lead player in the formation of Creative Commons and Reddit. His contribution to an open Internet was also recognized in yesterday’s advocacy actions.
However, a year after Swartz’s tragic death, others are still victim to the legal nightmares of the CFAA. Andrew Auernheimer is currently serving a jail sentence for having created a tool that automatically scraped data from an AT&T Web site. He changed his computer setting so that the AT&T Web site thought he was using an iPad and sent him to an unsecured Web page where Auernheimer could use an automated number-generating script to troll for individual e-mail addresses because of a site security flaw. (He did not use the individual e-mails himself, but he did alert journalists to the security breach.)
As Electronic Frontier Foundation (EFF) Staff Attorney Hanni Fakhoury told me for ProgrammableWeb’s upcoming in-depth series on Web scraping, if Auernheimer had used an iPad and a pad and pencil and individually written down all the e-mails and then entered them into a computer, he could not be charged. But, because he automated the process with a Web scraping script (and declared he was using an iPad when he was not), this breached the “authorized access” provisions. So AT&T took him to court, where his activity was recognized as a criminal act and Auernheimer is now serving a 41-month prison sentence. The EFF is currently working on an appeal of the case.
Unfortunately, this is not just a U.S.-centered problem. Fakhoury points out that international businesses using a similar approach to scrape Web data could be charged under the CFAA. Furthermore, in recent weeks, French blogger and entrepreneur Olivier Laurelli was fined $4,100 for accessing Google-indexed government documents that had been made public but, apparently, were meant to sit behind a login authentication page.
Given the CFAA criminal implications in the United States, and the prosecution of data scraping developers in other countries, ProgrammableWeb wanted to help entrepreneurs and businesses who scrape and mine public data to be aware of their rights, obligations, and best ways to protect themselves. After consulting with several U.S. and international legal experts, we suggest the following process for developers seeking to use data for social good and for businesses looking to innovate with public data sources. Please note that this is not legal advice but merely a summary of best-practice recommendations as discussed with legal advocates and public data rights activists.
First and foremost, we encourage ProgrammableWeb readers to stay informed about data rights issues and to consider participating in activities of the EFF. Follow the foundation on Twitter, and check the EFF blog regularly.
We also encourage you to review the terms of service of Web sites and apps that are collecting your data and ensure that you have the right to access your own personal data and can see how the Web site or app plans to use the data it collects about you.
For entrepreneurs and developers with a hankering to play with data from the Web, we suggest the following process when scraping data from the Web (whether that is done directly from Web sites or via their APIs):
1. Check the copyright notice on the Web site and API from where the data is being sourced. Save a copy with date cited.
2. Check the terms of service, and save a copy with date cited.
3. If data released under Creative Commons or the Terms of Service of the API/Web site allows for reuse of site data in a way that matches your planned use, contact the data owner to introduce yourself and explain how you plan to use the data. Keep a copy of all correspondence.
4. Collect your data and store it in a spreadsheet. Keep a log of the process used to scrape the data.
5. Identify all publication sources where the scraped data will be presented (via your own API, your own Web site, blogs and other generated content). Ensure that the source of the data is acknowledged as stipulated in the original owner’s copyright and as the terms of service describe. Inform the original data owner of the reuse and attribution.
In early April, we will be publishing the first part of our in-depth look at Web scraping and APIs. This series starts by exploring some of the legal and process issues involved in scraping Web data and creating API endpoints from it. In Part Two, we will review available Web tools that scrape data and store it via an API interface. Future parts of the series will explore coding tools that create APIs from scraped data. Finally, we will explore business models and commercial opportunities that can be generated by using public data scraped from the Web. Throughout the series, we urge readers to return to the above process to best ensure your data rights protection.
Data rights is set to become an even bigger issue this year. In addition to the issues raised above and by the organizers of The Day We Fight Back, we are also entering uncharted territories with wearable tech, connected homes and devices, smart cities, and the Internet of Things. In each of these domains, while personal and community-level data is being collected (and perhaps already being watched by the NSA), how developers can access this data has yet to be clarified. The challenge for many tech start-ups and data business innovators will not only be how to protect the personal data they collect, but also how they share it with the data owner and how they define rights of access to third parties that might want to aggregate anonymized data items from these sources.
By Mark Boyd. Mark is a freelance writer focusing on how we use technology to connect and interact. He writes regularly about API business models, open data, smart cities, Quantified Self and e-commerce. He can be contacted via e-mail, on Twitter or on Google+.