Mapping and geospatial datasets in Data.gov

Take U.S. Government data and put it on the map

Few things inspire insights in business analytics better than effective maps. A great deal of supply-chain management, sales, and overall business strategy follows the nuances of geography. This extends to broader uses of analytics such as in health care administration, socially aware enterprise, political astuteness, educational support and more. The U.S. Federal Government provides a surprisingly high volume of great data from a variety of sources, which can be a vital ingredient to effective analytics. Learn more about geospatial data sets in Data.gov, including how to load them into Google Earth and adapt them to other analytic and general tools.

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is partner at Zepheira where he oversees creation of sophisticated web catalogs and other richly contextual databases. He has a long history of pioneering in advanced web technologies such as XML, semantic web and web services, open source projects such as Akara, an open source platform for web data applications. He is a computer engineer and writer born in Nigeria, living and working near Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his weblog, Copia.



03 July 2012

Also available in Russian Portuguese

In my article "Data.gov for government agencies" I introduced the U.S. Federal Government's relatively new website and platform for making data open and available to the public. There have been some interesting developments in Data.gov since then, including the open-source release of the Data.gov code base (see Resources).

From the early days of Data.gov, however, there has been particular interest in the sharing of geographic information. There is a vast amount of information about managing territory, primarily over land, and to some extent over sea. This information ranges from large-scale land management at the federal level to cadastral property, and government services at the state and even county level. Such information is often presented in the form of maps and map layers, and there is a separate section of Data.gov to handle such geospatial datasets. In fact, at the time of writing there are almost 450,000 geospatial datasets versus fewer than 5,000 non-geospatial ("raw") data sets.

Clearly people think first of a government context when they see such data. But Data.gov can free government data from this strict public sector context. Businesses looking to plan everything from supply-chain management or marketing campaigns to store openings can use such data to determine the best strategies in combination with other factors, from financial to market conditions. In other words, the geographical data in Data.gov can become an important ingredient in business analytics. In this article, learn how to use geographical datasets of Data.gov, with an eye on business analytics.

Finding the data

From the home page of Data.gov, you'll find links to the geospatial datasets right next to those for "Raw Data." The geospatial catalog looks similar to that for Raw Data, as you can see in the screen shot, Figure 1. Obviously it would take great effort to explore over 400,000 datasets looking at them in sets of 10 or even 100 per page, thus the search capability is key. Unfortunately, it doesn't seem you can search spatially, for example finding data sets with map regions that include Latitude: 40.014986, Longitude: -105.270546. Of course you can do a metadata search, for example, for "Boulder, Colorado," but there are many reasons why that would not be as useful.

Figure 1. Screen shot from the geospatial catalog of Data.gov
Screen shot from the geospatial catalog of Data.gov

Data.gov is integrating a separate site, Geo.Data.gov, which provides more geospatially aware features, including what looks to be search by intersecting or fully-contained map region, but this did not seem to be working at the time of writing. Geo.Data.gov is the evolution of the Geospatial One-Stop (GOS) portal, which dates from 2003 as part of a U.S. e-government initiative. It is accessible through the web geo-portal and through popular Geographic Information Systems (GIS) tools. Currently, the latter option may be best for spatial searching.

Most of the geospatial data sets are in ESRI shapefile format, which is still the dominant format in GIS. A few are in the form of access to map servers provided by the source agencies, in which case you get a limited preview of the maps right within Data.gov. With the rise of online mapping it would be nice to see more formats such as Keyhole Markup Language (KML), an open standard maintained by the Open Geospatial Consortium Inc. (OGC). Only a very tiny percentage of geospatial datasets in Data.gov are available in KML; I wanted to get a sense of just how many. The obvious approach would seem to be by file type, but in the advanced search, you can only select PDF, text, and Microsoft® Office formats. I just did a full-text search for "KML" and another for "KMZ" (a compressed KML file) and got a few hundred total hits.

Beyond the obvious search

Of course, less important than searching for datasets by format is finding data sets by topic and characteristic that support your particular need in analytics. Data.gov is not exactly great for serendipitous discovery, so it's important to pay attention to the communities of interest, such as Health.Data.Gov or BusinessUSA. You can reach these from the "Communities" tab on Data.gov. That way you have a chance of learning of useful datasets informally indexed by interest.

I also discovered that many of the data sets with KML are not placed in the geospatial section (though most of those using shapefiles are), but rather in the "raw" data section, so if you do use search features you might want to do so site-wide.


Loading a geospatial data set

As I've mentioned geospatial datasets on Data.gov are available in ESRI shapefiles, with some KML and a few other minor formats. I'll be focusing on KML in this article. The great thing about KML is the flexibility it provides for lightweight analytical applications. You can use it with Google Earth or other 3D virtual globe software such as Microsoft Live Earth and NASA World Wind.

To serve as example I'll select the "Map of Federal Lands Highway (FLH) American Recovery and Reinvestment Act (ARRA) projects", which is a mouthful, as are so many government-related names, but it just specifies sites where funding for a variety of projects has been provided by certain government and state agencies. See Resources for the Data.gov page. The external KML file download is in the upper right of the page. The actual file downloaded is flh-arra-projects.kmz, and is a zip of the plain XML KML file. You can load it as is into Google Earth. (see Resources). Once you download, install and launch Google Earth you should be able to just drag and drop the KMZ file on to the running app window. The coordinates will be displayed as in Figure 2.

Figure 2. FLH/ARRA dataset loaded into Google Earth
FLH/ARRA dataset loaded into Google Earth

There is metadata associated with each set of coordinates, which in common KML style is to be found embedded in the HTML descriptions. Google Earth loads each such HTML snippet when you click any of the place markers, as shown in Figure 3.

Figure 3. FLH/ARRA dataset in Google Earth, with pop-up dialog from one of the points
FLH/ARRA dataset in Google Earth, with pop-up dialog from one of the points

The KMZ bundle includes a KML file and an image for the legend, which you can see in the lower right of Figure 2. Listing 1 is a snippet from the KML, representing one of the place marks.

Listing 1. Snippet representing one of the place marks
<Placemark>
  <name>1</name>
  <description><![CDATA[State Code = 1668<br>
  ARRA Funding =Agency<br>
  Client Agency = BLM<br>
  State Project number = AZ BLM 93(1)<br>
  Project Name = Monolith Gardens Trailhead Turnoff<br>
  Congressional District = 02<br>
  County Name = Mohave<br>
  State = AZ<br>
  Project description = Add a paved right turn lane to US Highway 93
   at the driveway to the Monolith Gardens Trailhead<br>
  Program = Title 16/B<br>
  Status = Awarded<br>
  Contracting Method = Full & Open<br>
  Ecconomically Distressed Area = Y<br>
  Length (Mile) = 0.470<br>
  Date = 2/23/2010<br>
  Amount = $59,341.34<br>
  Contractor = CH2M Hill, Inc., 9191 South Jamaica Street, Englew<br>
  Federal Aid Project Number = 1516049300001<br>
  Service DEscription = Construction<br>
  Latitude = 35.20605<br>
  Longitude = -114.09483<br>
  <br><br><br>]]></description>
    <styleUrl>#BLM</styleUrl>
    <Point>
        <coordinates>-114.09483,35.20605,0</coordinates>
    </Point>
</Placemark>

I've added line breaks for article formatting purposes, but these will also help you see the simple, dumb structure used to encode the metadata in the description element. It's not very friendly for machine processing and could be better even for human readers, but I was able to hack the information out of it using a simple script. My interest in doing so was in supplementing the very neat Google Earth view with another tool that's quite handy for analytics.


Experimenting with the data

I wrote a simple program, seen in Listing 2, which parses fields from the description and combines these with the map coordinates to create a simple, JavaScript Object Notation (JSON) representation of the dataset. The code is in Python, provided with comments for convenience; you needn't fully understand it."

Listing 2. Program that creates a simple JSON representation of the dataset
#import from standard Python library
import sys
from datetime import datetime

#import from the amara library
from amara.bindery import parse
from amara.thirdparty import json
from amara.lib import U

#Parse the KML document
kmldoc = parse(sys.stdin)

#Set up XML namespace declarations
PREFIXES = { u'kml': u'http://www.opengis.net/kml/2.2' }

items = []
#Iterate over all the placemarks in the document
for pm in kmldoc.xml_select(u'//kml:Placemark'):
    coords = U(pm.Point.coordinates)
    #Flip coordinates around to lat,long
    lng, lat, alt = coords.split(u',')
    fixed_coords = u','.join((lat, lng))

    #Each placemark's fields are encoded in the description, separated by HTML linebreaks
    desc = U(pm.description)
    fields = desc.split(u'<br>')

    #The object representing this record's fields
    item = {}

    #Parse the fields
    for field in fields:
        if u'=' in field:
            key, val = field.split(u'=', 1)
            item[key.strip()] = val.strip()

    #Fix up a few of the fields
    item['id'] = item[u'Project Name']
    item['label'] = item[u'Project Name']
    item['latlong'] = fixed_coords
    date = item.get(u'Date')
    if date:
        d = datetime.strptime(date, '%m/%d/%Y')
        item[u'Date'] = d.isoformat()

    #Add each field to the record set
    items.append(item)

#Emit the JSON
json.dump({ u'items': items}, sys.stdout, indent=4)

More important is the output of the script, which is a set of records each of which is similar to Listing 3.

Listing 3. Example of script output from running Listing 2
   {
        "Contractor": "CH2M Hill, Inc., 9191 South Jamaica Street, Englew", 
        "State": "AZ", 
        "Program": "Title 16/B", 
        "Latitude": "35.20605", 
        "Service DEscription": "Construction", 
        "id": "Monolith Gardens Trailhead Turnoff", 
        "Status": "Awarded", 
        "Project Name": "Monolith Gardens Trailhead Turnoff", 
        "label": "Monolith Gardens Trailhead Turnoff", 
        "Client Agency": "BLM", 
        "Length (Mile)": "0.470", 
        "Ecconomically Distressed Area": "Y", 
        "Date": "2010-02-23T00:00:00", 
        "County Name": "Mohave", 
        "State Code": "1668", 
        "Longitude": "-114.09483", 
        "Amount": "$59,341.34", 
        "ARRA Funding": "Agency", 
        "Congressional District": "02", 
        "Project description": "Add a paved right turn lane to US
         Highway 93 at the driveway to the Monolith Gardens Trailhead", 
        "latlong": "35.20605,-114.09483", 
        "Federal Aid Project Number": "1516049300001", 
        "State Project number": "AZ BLM 93(1)", 
        "Contracting Method": "Full & Open"
    },

With this JSON representation I can use my preferred tool for analytics, in this case the US Library of Congress's Viewshare.org, which was developed by my company, Zepheira.

Figure 4 is a screenshot of this data set visualized through Viewshare. Viewshare software supports maps, timelines, and other such views, and makes it easy to answer such analytic questions as "How many state government projects were awarded in Colorado in 2009?"

Figure 4. Screenshot of FLH/ARRA data set visualized by Viewshare
Screenshot of FLH/ARRA data set turned into a visualization in Viewshare

If you look closely, you'll notice that the maps in Viewshare are different from those in Google Earth. Viewshare uses OpenLayers, which illustrates the flexibility in mapping tools that is available now.


Conclusion

As I've mentioned, most of the data sets on Data.gov are geospatial, which emphasizes the importance of place in government programs, and this importance translates to the general usefulness of the plentiful data the U.S. Government produces these days. As I've demonstrated in this article, with just a bit of work you can use this data in the analytics or other software of your choice.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management, Cloud computing
ArticleID=823248
ArticleTitle=Mapping and geospatial datasets in Data.gov
publish-date=07032012