In 2009 the United States, under Federal Chief Information Officer (CIO) Vivek Kundra, launched an ambitious website and service. Data.gov serves as a repository for information collected and managed by the federal government, and is available for use by the public.
Leaders and developers in technology have long called for a culture of open data, meaning transparency and portability of data generated and used by institutions. The Internet has transformed every sphere of society largely through its foundation on the free movement of information almost regardless of traditional borders, interests, and practical barriers. Most proponents of data transparency accept that barriers should always remain for reasons of privacy and security, but they argue for as much availability and interchange of information as possible. They claim information flow is a powerful engine for generating new business and public good in the knowledge economy.
You can imagine that such an argument would garner the attention of U.S. government, which looks to nurture business and increase public good. In addition, the data in question belongs to the taxpayer, who funds the agencies that control the data. This is an extension of the open data argument in business where customers demand access to data relevant within his or her account information. The launching of Data.gov announced the recognition of these facts by one of the largest organizations in the world, and opened up some exciting possibilities for businesses, media, and concerned citizens in general.
Data.gov is not the first open data initiative in the U.S. In 2000, the National Institutes of Health (NIH), working with the Food and Drug Administration (FDA), launched ClinicalTrials.gov, a site that made information pertaining to the public clinical trials that are part of the regulatory process for any medical therapy. NIH initially provided ClinicalTrials.gov as a seed site that then grew in scope under expanding regulatory guidance of the FDA. The site has now become a rich trove of information related to the development of drugs, whether privately or publicly funded. Another site, Science.gov, has made U.S. government-sponsored scientific information and research results available since 2002.
With the growing importance of such pioneering sites, there was a lot of discussion of open data in governments worldwide in the late 2000 decade, particularly in the U.S. and in the United Kingdom, where they launched Data.gov.uk shortly after its U.S. counterpart. An important catalyst in the U.S. was the Open Government Initiative (OGI), put in place by President Barack Obama on his first day in office, January 20, 2009.
The Office of Management and Budget (OMB) prepared and released a Concept of Operations (CONOPS) document to give shape to Data.gov, and continues to evolve this document as a blueprint for the site. Pursuant to the OGI, the OMB also, in December, 2009, released a memo entitled "Open Government Directive" to Federal agencies. The memo articulated a strong default position of openness with data that agencies should adopt, mentioning, for example, Attorney General Eric Holder's new guidelines that openness is the Federal Government's default position for matters relating to the Freedom of Information Act (FOIA). The memo included the following instruction:
Within 45 days, each agency shall identify and publish online in an open format at least three high-value data sets (see attachment section 3.a.i) and register those data sets via Data.gov. These must be data sets not previously available online or in a downloadable format.
With this simple stroke, the OMB mandated the growth of valuable information within Data.gov. This memo is a very interesting executive-level document for study by any government institution interested in an open data policy, and I shall return to it in this article.
Data.gov is made possible by the Electronic Government Fund (EGF) budget item, but it was established purely through executive order, and thus is not guaranteed funding through congressional appropriations. The EGF was considerably reduced by the 2011 Federal budget, which resulted in some cutbacks to Data.gov and departure of key staff, such as Program Executive Sanjeev Bhagowalia. Nevertheless, the project has adapted and almost surprisingly flourished despite these setbacks. A Data.gov "next generation" was launched, which moved most of the site infrastructure to the cloud, to reduce maintenance costs and also to enable more dynamic processing of the data on the site itself, rather than simple, static download.
Data.gov hosts data in several ways. It hosts raw data and geospatial data. The latter is data especially suited for use in mapping applications and mashups. This article focuses on raw data. Data.gov hosts some data sets as links and access modules to external government sites, and it hosts some data sets fully. In the former case (termed "external datasets") Data.gov is hosting just the metadata, and in the latter case (termed "datasets"), it hosts both data and metadata. If you are an agency looking to publish on Data.gov, which approach you take will depend on whether you already have a platform in place to host data yourself and whether you want to gain the advantage of Data.gov's interactive data set features. Either way, you'll gain the advantage of the Data.gov catalog.
At the heart of Data.gov is the catalog, which allows users and applications to browse, explore, search, and filter data sets. Figure 1 is a screen shot from the catalog of raw data. You can see names, descriptions, hit counts and types for each data set. On the left hand side you have options for filtering by data set type or federal agency. You can also search data set metadata.
Figure 1. Screen shot from Data.gov raw data catalog
(View a larger version of Figure 1.)
If an agency chooses to have Data.gov host the data as well as the metadata, they gain the benefit of Data.gov's interactive Web display of the data. Such interactive data sets are displayed online in a table, allowing searching, sorting, filtering, and display in charts and graphs. This allows many users to get the information they need without having to download the actual raw data and process it themselves. Figure 2 is a screen shot from the interactive view of one of the data sets, "Tax Year 2007 County Income Data". You can see the first 21 of 3193 rows, with a part of the columns, from "State Abbreviation" to "Wages Income." You can also see the tools at the top to filter, visualize or export the data.
Figure 2. Screen shot from Data.gov interactive data set view
(View a larger version of Figure 2.)
Figure 3 is a screen shot from the same data set as figure 2, but illustrating the filtering features. You can see on the right hand side the filtering criteria.
Figure 3. Screen shot from filtering in Data.gov interactive data set view
(View a larger version of Figure 3.)
The interactive data set feature also allows a user to export the full data set, or a subset from an applied filter. Figure 4 is a screen shot of the dialog to export the rows that have been selected by the filter shown in figure 3.
Figure 4. Screen shot of export from Data.gov interactive data set view
(View a larger version of Figure 4.)
After you click the format it immediately downloads to your browser. Listing 1 is a clipping of the first 2 rows from the resulting XML.
Listing 1. The first 2 rows from the resulting XML
<?xml version="1.0"?> <response> <row> <row _id="249" _uuid="A198E315-1C23-4004-A6F2-97321F9AC9ED" _position="249" _address="http://explore.data.gov/views/d2bg-b3vp/rows/249"> <state_code>8</state_code> <county_code>0</county_code> <state_abbreviation>CO</state_abbreviation> <county_name>COLORADO</county_name> <total_number_of_tax_returns>2106989</total_number_of_tax_returns> <adjusted_gross_income_in_thousands_>128175529</adjusted_gross_income_in_thousands_> <wages_and_salaries_incomes_in_thousands_>92308039</wages_and_salaries_ incomes_in_thousands_> <dividend_incomes_in_thousands_>2775567</dividend_incomes_in_thousands_> <interest_income_in_thousands_>3872386</interest_income_in_thousands_> </row> <row _id="250" _uuid="1576D628-29CC-4D44-BCA5-76750198AEF0" _position="250" _address="http://explore.data.gov/views/d2bg-b3vp/rows/250"> <state_code>8</state_code> <county_code>1</county_code> <state_abbreviation>CO</state_abbreviation> <county_name>Adams County</county_name> <total_number_of_tax_returns>180985</total_number_of_tax_returns> <adjusted_gross_income_in_thousands_>8622959</adjusted_gross_income_in_thousands_> <wages_and_salaries_incomes_in_thousands_>7195284</wages_and_salaries_ incomes_in_thousands_> <dividend_incomes_in_thousands_>64424</dividend_incomes_in_thousands_> <interest_income_in_thousands_>138173</interest_income_in_thousands_> </row>
The data export dialog also gives you options to print and to access the
data set as an API through an external program. For example, you can
(JSON), suitable for many Web applications, by just HTTP GET to URL
If you are a government agency in the U.S. or anywhere else, and you are looking for the best ways to empower citizens and increase the recognition of the value that you provide, you can learn a lot from what Data.gov has accomplished.
First, you might well consider the value of open data. Holding on tightly to the data used in doing the people's work rarely comes with any inherent advantages. Increasingly citizenry the world over are looking for transparency and utility from their governments. It is worth making a clear-eyed assessment of whether it makes sense to adopt a policy that you make data available unless there are well-understood reasons for not doing so.
If you have decided to participate in the move towards open data, you will need to reconsider your overall information technology architecture. It becomes more and more important to establish a clear chain of custody for data, from when it is collected to when it is stored, tracking provenance and other metadata. Where possible, ensure that data export to simple, standard formats is part of any software requisition. Consider a solution such as the IBM Government Industry Framework (see Resources), which enables smarter government across the board, including support for increased transparency.
Other recent computing trends are relevant as well. Open source software very often has strong support of open, standard data formats. Cloud computing offers lower infrastructure and maintenance costs if you are willing to accept some risk of a change to your approach to security. If you do decide on open data you will enjoy an implicit reduction of these risks.
Data.gov is barely two years old but it has already been on a winding road, but with almost 400,000 data sets and well over 1,000 known applications, it has proved a success by any measure, which is even more remarkable when you consider its budgetary difficulties. The fact that Data.gov was able to change course when the going got tough and move to the cloud illustrates the ancillary benefits in flexibility that come about from pursuing the bold strategy of smart, transparent, open government that the U.S. undertook in 2009.
- It's well worth reading the "Open Government Directive" memo issued by Peter R. Orszag,
Director of the OMB in support of Data.gov on December 8, 2009.
- Read Sanjeev Bhagowalia's farewell Weblog entry for more about the
transition that Data.gov seems to have navigated after the budget cuts of
- Bookmark XML.gov, a government site "to facilitate
the efficient and effective use of XML through cooperative efforts among
government agencies, including partnerships with commercial and industrial
- Learn more about "Cloud computing by government agencies" by
Shahid N Shah (developerWorks, Aug 2010).
- In the Government industry area on developerWorks,
find the resources you need to advance your efforts to serve the public
- Stay current with developerWorks technical events and
webcasts focused on a variety of IBM products and IT industry
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from
product installation and setup demos for beginners, to advanced
functionality for experienced developers.
- Collaborate with and support the federal
community through the Federal Strategy
& Technology Institute.
- Visit the Analytics
Solution Center to participate with industry experts and Federal
agencies in analytics solutions and technology demos for solving mission
Get products and technologies
- Visit Data.gov for thousands of open data
sets from the U.S. Federal Government.
- Adopt the IBM Government Industry Framework to help
you integrate systems and processes across a broad range of public service
Evaluate IBM products in the way that suits
you best: Download a product trial, try a product online, or use a product
in a cloud environment.
- Get involved in the developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
Uche Ogbuji is partner at Zepheira where he oversees creation of sophisticated web catalogs and other richly contextual databases. He has a long history of pioneering in advanced web technologies such as XML, semantic web and web services, open source projects such as Akara, an open source platform for web data applications. He is a computer engineer and writer born in Nigeria, living and working near Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his weblog, Copia.