Drafting Guidelines for Government Data Catalogs

2010-03-29

Originally posted on the Sunlight Labs blog. Written with David James.

A major focus of the Sunlight Labs is to push government to publish its data online. In recent months, we’ve gained in-depth familiarity with government data catalogs through our work on the National Data Catalog. The most prominent example of a data catalog is data.gov. Since its launch last year, a handful of states and cities have followed suit with their own efforts. As more data catalogs come online, we want to make sure their contents are open and exchangeable. We want to determine how to best structure the data catalog itself, and we want to ensure that the metadata it contains – the data about the data – exists in the most accessible way possible.

Last week, Clay posted three challenges for the community to tackle, and this is challenge #3. We’re looking to start this conversation now and move towards consensus within a few months. I was at Transparency Camp, digging deeper into this topic, putting us on the path to make recommendations that governments can adopt quickly.

Resource-Oriented Architecture

A data catalog lives on the Web, so it makes sense that it embrace the architecture of the Web. A simple Web site with a few pages is good for citizen use, but doesn’t lend itself to being interoperable. As such, we strongly recommend following the Resource Oriented Architecture (ROA) guidelines. When applied to the concept of a government data catalog, ROA means:

Resources that represent the data sources, agencies, and jurisdictions (if there are more than one) should exist.
Each resource should have a unique URI, so that each one can be addressed individually.
Resources should be made available in both human-friendly (HTML) and machine-readable formats (such as XML or JSON).
Content negotiation should be used to serve the resource correctly depending on the user agent.

Defining a Vocabulary

Let’s turn to the metadata that describes a particular government data source. Existing government data catalogs like data.gov have already established some best practices. But going back through the Internet’s history, it’s helpful to contemplate the work of the Dublin Core, which over a decade ago published a set of fifteen metadata elements for describing online resources. The Dublin Core spec applies to a wide range of things published online, from videos to academic papers to datasets. The fifteen elements forming the spec give us a good starting point, and remind us how similar online government data is to any other kind of resource published online. Some elements are not fully applicable (Language and Contributor, for example). Others can be broken up into several elements. Coverage can mean a physical geographic area, a political jurisdiction, or a period of time.

Going from theoretical to practical, let’s look at some sample data pages for existing data catalogs:

Looking at those four examples, we begin to see a lot of similarities among the catalogs and with Dublin Core. From these, we propose a preliminary vocabulary:

Element Name	Description
Title	Title of data source.
Description	Short description of data source.
URL	Permanent, unique URL that contains this metadata. Can be self-referential.
Type	Is the data source a dataset, API, or online database?
Downloads	File format/URL pairs that point to data files.
Created	Creation date of the data source.
Released	Release date of the data source.
Last Updated	Update date of the data source.
Update Frequency	How often the data is updated
Creator	Entity (agency, department, or organization) that created the data.
Publisher	Entity that published the data.
Maintainer	Entity that maintains the data.
Jurisdiction	Political jurisdiction of the data.
Time Period	The time period the data refers to.
Grouping	Can the data be grouped with a larger set of similar data? Recommended for data sets scoped to a time period or jurisdiction.
License	The license under which the data set is released.
Documentation	Any documentation, such as a data dictionary, or a reference (URL) to that documentation.

This vocabulary forms a base-level set of metadata. Individual data catalogs can and should publish more elements as appropriate, but serious efforts should be put into ensuring that the fifteen elements above are present.

The Formats

With a vocabulary defined, we can move to the actual data formats to represent an entry in a government data catalog. We don’t need to just choose one. Instead, we can map our general vocabulary to several existing data format standards:

XML: Maps easily to the vocabulary defined above. Loved by the enterprise.
JSON: XML’s lighter-weight alternative, loved by modern Web developers (and the Sunlight Labs).
CSV: A good compromise between machine-friendliness and human-readability.
Microformats: Can act as an easy-to-implement solution since these can be placed on plain ol’ Web sites with only a little bit of effort.

Lastly, and possibly most importantly, we need a data format to represent updates to entries in a data catalog. For that, we recommend the Atom Syndication Format. An individual entry in the Atom feed would contain the unique URL identifier and any updated elements. The entry should use any one of the four formats above, with a bias towards XML, since Atom itself is XML.

So in summary, the preliminary proposal is:

Ensure that you have enough metadata to closely follow the vocabulary.
If publishing the catalog as a Web site, use the Microformat on the data detail pages.
If publishing an API, use XML or JSON in conjunction with Resource-Oriented principles.
If publishing the data catalog as a bulk download, use XML, JSON, or CSV.
To publish updates of entries in the catalog, use Atom.

We’ll be fleshing out examples of these data formats in the Sunlight Labs Wiki over the coming weeks. I’ll be working with the good people of Socrata, who announced their SODA API at TransparencyCamp on Saturday, and any other folks interested.