An upcoming PDF Liberation Hackathon is aimed at raising developers’ skills in unlocking data from PDF sources. The hackathon will be held onsite in Washington D.C. and in San Francisco on Jan. 17 – 19, 2014, while international developers can also compete remotely. ProgrammableWeb spoke with organizer Marc Joffe about how API-focused developers can participate.
For any developer trying to wrangle data out of online source documents and into a format that is machine-readable and programmable, one of the most frustrating challenges is how to extract data from PDFs. This problem is set to become even greater as more open data supplied from governments is provided in PDF format. (Recent reports on ProgrammableWeb confirm that, initially, when European and U.S. government departments start opening their data, many begin by making existing data available online, in whatever the format they have it, including PDF.)
The cash prizes are small, but they are not the main draw for this challenge. Instead, developers will get a rare opportunity to work together on one of the most disheartening aspects of data mining: extracting data from PDFs.
“We will have several cash prizes in amounts up to $500,” Joffe confirmed to ProgrammableWeb. “Winning entries will be featured on the hackathon page and any pure open source winners will be featured on Sunlight’s Developer Community page. This is a great way for entrants to get publicity and establish a reputation within the open data and open government communities.”
Peter Murray-Rust ran a similar hackathon in France earlier this year and, explained in his blog the benefits of participating in this type of challenge:
“Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities.”
To prepare for the hackathon, the organizers have launched a comprehensive list of PDF extraction tools and software – predominantly open source – that competitors can use. Competitors are invited to make use of the tools directly or to build features and plugins to create problem-specific solutions, depending on the source dataset. Joffe explains:
“I think that the tools created at the hackathon will be potentially useful by themselves without further extension. Say, for example, that a team works on extracting non-profit executive compensation information from IRS Form 990s (in the United States). [Writer's note: This form shows the salaries of key personnel working in not-for-profit institutions. Particularly in sectors like healthcare, the salaries of CEOs of not-for-profit/charitable organizations can mimic high-end private sector pay scales and mining data in IRS Form 990s can reveal these imbalances.] New 990s are being published all the time, so the tool could be applied on an ongoing basis just to this class of documents. The same could be said for precinct level crime reports in New York City and government financial audit reports. Obviously, anything coming out of the hackathon will be rough, but I hope a number of new application-specific extraction repositories will be released and then incrementally improve after the event.”
Two of the problems with data extraction from PDFs are that the data still may need additional cleaning, and that it is still not in a machine readable/API format that allows for it to be easily plugged in to visualizations or to be maintained in realtime. “Data may contain errors – especially if they were produced via OCR. It is best to perform range checks on numeric outputs and to test for internal consistency (for example, if the PDF contains both a total and all the elements making up that total, the elements can be summed and then compared to the total),” Joffe said. “Also, a number of commercial players are offering PDF extractions on a SaaS basis. These include ABBYY, BCL Technologies and IDR Solutions. DocumentCloud offers an open source PDF upload, text extraction and display service – but only to journalists. Developers participating in the hackathon can implement their extraction solutions as web APIs, thereby saving prospective users from the pain of installation and deployment.”
Data sources to be explored in the challenge include government financial data, crime statistics, campaign contribution disclosures, Congressional financial disclosures, regulatory filings and non-profit disclosures in IRS Form 990. Challenges will be published on the PDF Liberation website three days before the hackathon commences, giving time for international teams to assemble and compete alongside those entrants attending one of the hackathon sites in the US.
With his main focus being on public sector financial data, Joffe is planning his challenge around data sources that he comes across in his work daily. “My challenge will involve local government financial audits. There will also be challenges related to crime statistics, campaign finance and IRS form 990. But the challenges need not all be financial. It is also worth pointing out that participants do not have to use one of the pre-selected document categories. If they are interested in other classes of PDFs, they can hack those as well, and may get creativity points in the final judging. Also, a team could simply make enhancements to an existing tool like Tabula. These enhancements need not relate to a specific class of PDFs; they could just benefit all Tabula users,” Joffe said.
Developers can register to compete via the PDF Liberation website. Developers interested in data mining are also encouraged to check out the PDF extraction resource list being provided by the organizers on the event website.
By Mark Boyd. Mark is a freelance writer focusing on how we use technology to connect and interact. He writes regularly about API business models, open data, smart cities, Quantified Self and e-commerce. He can be contacted via email, on Twitter, or on Google+.