What’s in Data.gov?

Guest Author, July 20th, 2009

datagovlogoEditor’s note: This guest post comes from Jim Hendler, a professor, web researcher, and Semantic Web evangelist working at Rensselaer Polytechnic Institute. You can see more of his teams’ ongoing research at Tetherless World.

A recent article by Tim Berners-Lee, “Putting Government Data online“, has  attracted significant interest to the  datasets published at the US data.gov website.  As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation of the contents of these government datasets.

I. Translate dataset into RDF

The catalog of the datasets in data.gov,http://www.data.gov/details/92,  is published in CSV format as part of data.gov. We  converted it into RDF using simple CSV parsing. We kept the translation minimal: (i) the properties are directly created from thecolumn names; (ii) each table row is mapped to an instance of pmlp:Dataset; (iii) all non-header cells are mapped to a literal – we don’t create new URIs at this point. The output of our work is published on Tetherless World website at:

http://data-gov.tw.rpi.edu/raw/92/catalog.rdf

(We are now starting to do more  integration work, extracting multiple objects from single tables, linking into the linked open data  cloud, etc.  and will publish new version when that is done – the purpose of this first work was simply to make the catalog more available to the RDF community)

II. Browse and query the RDF graph

As an example, we can browse the dataset in tabulator, and then use a SPARQL webservice to query the dataset. For example, we use a sparql queryto list datasets published in CSV format:

http://onto.rpi.edu/sw4j/sparql?queryURL=http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql

III. Observations on the RDF graph

Using this service we can answer some basic questions about the data.gov datatsets:

1. How many datasets are published, and how many among them can be easily converted into RDF?

There are 332 datasets which can be partitioned by  type:  raw data catalog(301);  tool catalog (31).

Not all of the datasets have a link to downloadable data because some offer only browseable data via their own websites,  Others  publish datasets in multiple formats. As of today, the online static files associated with the datasets are distributed as  follows:  204 datasets offer a CSV format dump, 10 datasets offer an XML format dump, and 21 datasets offer an XLS format dump.

2. How are the datasets categorized?

datagov_table1

3. What are some of the key items in the dataset?

4. What are the  sources of the datasets?

The majority of the datasets are published by the EPA, and they contain environmental data partitioned by the states of the US in three individual years.  Others come from other govt agencies – the distribution is as follows:

IV. Getting Datasets linked

Although the datasets are not explicily linked, we see a number of opportunities for connecting these datasets to others (and into the Linked Open Data datasets):

  • A large percentage of files have some sort of geo-tagging, thus they can be linked to DBpedia or Geo-names (and then presented via Map services).
  • Some datasets are subsets of other datasets, e.g. EPA data “2005 Toxics Release Inventory data for the state of Georgia” is a subset of  “2005 Toxics Release Inventory National data file of all US States and Territories” making for easier “internal” linking of the datasets.
  • A number of the datasets contain temporal information, e.g. IRS’s “Tax Year 1992 Private Foundations Study”,…”Tax Year 2005 Private Foundations Study” which provides an opportunity for mashups using timelines and such.

V. Conclusions

We are committed to getting more of the data.gov data online soon (in RDF), and then investigating data integration and knowledge discovery. In order to get our datasets linked to the linked data cloud, we will use SPARQL for extracting entities and our Semantic Mediawiki as a platform to capture the owl:sameAs mappings.  Scalable dataset publishing is also challenging as some of these are very large datasets, e.g. “2005-2007 American Community Survey Three-Year PUMS Population File” has a 1.1 g zipped csv file.  Moreover, some datasets are not directly available in one file but via a web service.  Our current plan is to produce RDF documents available for download soon, and to work on bringing more of these datasets into live, SPARQLable forms as we can.

Both comments and pings are currently closed.

10 Responses to “What’s in Data.gov?”

July 20th, 2009
at 3:41 am
Comment by: stoimen

Quite interesting!

July 20th, 2009
at 8:38 am
Comment by: Darklg Web (darklgweb) 's status on Monday, 20-Jul-09 12:38:28 UTC - Identi.ca

[...] http://blog.programmableweb.com/2009/07/20/whats-in-datagov/ [...]

July 21st, 2009
at 4:02 pm
Comment by: of cabbages and kings… » Putting government data online

[...] advisor to the government on how to make its data more widely available, it is interesting to see what has been happening in the US, and also to read some of Tim’s early thoughts on what we should be [...]

July 22nd, 2009
at 2:31 am
Comment by: Data.gov Revealed : Beyond Search

[...] ran an analysis by Jim Hendler called “What’s in Data.gov?” I must admit that I have not set aside the time necessary to figure out what this new government [...]

July 23rd, 2009
at 11:39 am
Comment by: Kenneth Udut

I’ve been fasctinated by the RDFization of so much data.

I’ve been hesitant to do too much with RDF just yet on my site, as it looks like excessive code to Google (and I’m still worried about rankings and things) — but I’m getting everything ready for a big “push” for when they get the kinks worked out of their systems and fully embraced linked data.(it’s just a matter of uncommenting a few lines of code here and there).

:: fingers crossed ::

excellent job converting the gov data – I love it and hope I can make some good use of it.

August 26th, 2009
at 2:29 am
Comment by: 3 Finalists Rise to the Top in Apps For America Contest

[...] contest came with a challenge to expose the contents of Data.gov, the initiative to increase public access to government [...]

November 30th, 2009
at 3:06 am
Comment by: How News Sites Are Using Maps

[...] more and more public data being made available now, such as the United States’ Data.gov. And, as O’Reilly Radar noted when EveryBlock was acquired by MSNBC, data is journalism. And [...]

February 13th, 2011
at 9:45 am
Comment by: Quest for a broad categorization of publishable government data | apoikola

[...] from EU member states and international list of data catalogues). This has created interest in analysing the content, comparing the catalogues and building unifying search interfaces to [...]

April 1st, 2011
at 10:24 am
Comment by: Data 2.0 Coming Next Week as U.S. Unplugs Data.gov

[...] with seven other sites, according to ReadWriteWeb. When Data.gov launched in 2009, we looked at what’s in Data.gov. The site was unveiled with much optimism about open government data, something long spearheaded by [...]

June 7th, 2012
at 9:36 am
Comment by: US Census API Makes Giant Data Dumps Are Optional

[...] the new open data movement. In a way, it was far ahead of its time, providing data long before Data.gov aimed to open up goverment. However, one of the issues as a developer I’ve always had with government data is that it [...]

Follow the PW team on Twitter

ProgrammableWeb
APIs, mashups and code. Because the world's your programmable oyster.

John Musser
Founder, ProgrammableWeb

Adam DuVander
Executive Editor, ProgrammableWeb. Author, Map Scripting 101. Lover, APIs.