Editor’s note: This guest post comes from Jim Hendler, a professor, web researcher, and Semantic Web evangelist working at Rensselaer Polytechnic Institute. You can see more of his teams’ ongoing research at Tetherless World.
A recent article by Tim Berners-Lee, “Putting Government Data online“, has attracted significant interest to the datasets published at the US data.gov website. As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation of the contents of these government datasets.
I. Translate dataset into RDF
The catalog of the datasets in data.gov,http://www.data.gov/details/92, is published in CSV format as part of data.gov. We converted it into RDF using simple CSV parsing. We kept the translation minimal: (i) the properties are directly created from thecolumn names; (ii) each table row is mapped to an instance of pmlp:Dataset; (iii) all non-header cells are mapped to a literal – we don’t create new URIs at this point. The output of our work is published on Tetherless World website at:
(We are now starting to do more integration work, extracting multiple objects from single tables, linking into the linked open data cloud, etc. and will publish new version when that is done – the purpose of this first work was simply to make the catalog more available to the RDF community)
II. Browse and query the RDF graph
III. Observations on the RDF graph
Using this service we can answer some basic questions about the data.gov datatsets:
1. How many datasets are published, and how many among them can be easily converted into RDF?
There are 332 datasets which can be partitioned by type: raw data catalog(301); tool catalog (31).
Not all of the datasets have a link to downloadable data because some offer only browseable data via their own websites, Others publish datasets in multiple formats. As of today, the online static files associated with the datasets are distributed as follows: 204 datasets offer a CSV format dump, 10 datasets offer an XML format dump, and 21 datasets offer an XLS format dump.
2. How are the datasets categorized?
3. What are some of the key items in the dataset?
4. What are the sources of the datasets?
The majority of the datasets are published by the EPA, and they contain environmental data partitioned by the states of the US in three individual years. Others come from other govt agencies – the distribution is as follows:
IV. Getting Datasets linked
Although the datasets are not explicily linked, we see a number of opportunities for connecting these datasets to others (and into the Linked Open Data datasets):
We are committed to getting more of the data.gov data online soon (in RDF), and then investigating data integration and knowledge discovery. In order to get our datasets linked to the linked data cloud, we will use SPARQL for extracting entities and our Semantic Mediawiki as a platform to capture the owl:sameAs mappings. Scalable dataset publishing is also challenging as some of these are very large datasets, e.g. “2005-2007 American Community Survey Three-Year PUMS Population File” has a 1.1 g zipped csv file. Moreover, some datasets are not directly available in one file but via a web service. Our current plan is to produce RDF documents available for download soon, and to work on bringing more of these datasets into live, SPARQLable forms as we can.