Your Input Wanted on Data

Originally posted on the Sunlight Labs blog.

Here at the Sunlight Labs, we’ve focused a lot on the recent bid on version 2.0 of This morning on the Labs mailing list, Rusty Talbot of Synteractive, one of the winning contractors, asked for input on the best way for to publish its data.

Rusty wrote:

“The Recovery, Accountability, & Transparency Board wishes to have an open discussion with all interested developers about how data should be made available via

As you are all aware, a new version of will be released soon. From a data standpoint, the initial release of the new site will replicate existing functionality. However, the Board aims to set a new standard of transparency with this site and would therefore like to make the data available in the most convenient and straightforward way (or ways) possible so you can use and analyze official, up-to-date Recovery Act data. We need your input to achieve this goal.

Please let us know how the site could best meet your needs in terms of machine-readable data format(s) and standards, APIs, guidance, training, etc.”

This is a great opportunity for all of us who work hard to make government data more open and accessible. Back in April, we at the Sunlight Foundation, as a member of the Coalition for an Accountable Recovery, published a report to the OMB giving recommendations on data feed publishing (starts on page 8 of the document).

As one of the project leads for the upcoming National Data Catalog, I’ve been thinking a lot about what makes government data truly useful for software developers, researchers, and journalists. Here’s my more technical take:

  • First, publish as much machine-readable raw data as possible. The data should be in a widely-used format, XML or JSON, and should be accessible from predictable, permanent URLs. Documentation clearly describing the schema and data types is essential.
  • As the Recovery Act programs are ongoing, document when, where, and how new data will be published. Set up a regular schedule for updates. Communicating updates could be as simple as setting up a mailing list or feed.
  • After the raw data is made accessible and proves to be useful, consider building an API that complements, rather than duplicates, the raw data sets. APIs are useful for developers who wish to get smaller, discrete sets of data without having to crunch through the entire raw data set. An emphasis should be placed on querying functionality that can be integrated into other apps.
  • The API should also speak XML and/or JSON, while following REST principles. Libraries to consume such APIs exist in all modern programming languages. Again, clear documentation is vitally important.
  • When it comes to data, the RAT Board and should eat its own dog food. That is, the data used internally by the RAT Board for analysis, tracking, and powering the website should be exactly the same data published to the public. This may not be technically possible right now, but it should be a goal because doing so would be the single best way to ensure that data being published is serving real-world needs.

So those are my thoughts. What are yours?

blog comments powered by Disqus