Civic minded hackers from all over the world recently organized themeselves for action as a part of the International Open Government Hackathon. The team from Portland, Oregon decided to make use of a platform called ScraperWiki that can grab data from government websites and turn into more consumable formats via the ScraperWiki API. Their work is an excellent example of developer ingenuity at unlocking data that is hard to use but still very useful.
The ScraperWiki work highlights a problem for governments that want to make data available to the public. Some data is currently available in obscure formats or is locked in HTML pages that cannot easily be consumed by web applications. “ScraperWiki is a great way to demonstrate to governments that programmers will put in work to clean up messy data, in whatever format it is released. the more imporant issue is authorizing the release of the datasets, not worrying about what format they are released in,” said event organizer Max Ogden. Ogden describes awesomely simple web scraping with ScraperWiki in the video embedded below.
In an ideal world these datasets would be available in JSON or XML. There might be any number of barriers to releasing the data this way. Budget priorities could dictate that other things come first. Some agencies just don’t realize the value of the data to developers or the public beyond a table on a static page. If the data represents a potential opportunity for recovering costs, some officials might be reluctant to make it easier for developers to use freely.
The ScraperWiki platform assumes that data wants to go beyond static web pages. ScraperWiki sets data free by establishing rules and automating data capture from the original site. Developers and hackers can input their scraper code in ruby, PHP, or python. The resulting data is then available through an API. The main requirement for successful scraping is that the input must have some sort of pattern or structure that can be parsed. This might sound like an imperfect solution, but if the data is valueble enough it makes sense for developers to spend to time extracting it.
Web page scraping has been a staple of civic hackers for years. In 2009 the New York Metropolitan Transportation Authority (MTA) sent a take down notice to Chris Schoenfeld for scraping transit schedules for his mobile bus app. The MTA leadership eventually reversed that decision and opened their data up to developers to make public applications with. The 2009 New York case and the recent work by the Portland hackers shows that willpower and ingenuity are powerful forces in helping the public get more from it’s data.