I noted previously that I'm working with a team on a new IBM Redbook initiative around Big Data Governance. We're delving into the 5 game changing big data use cases and the governance implications for each of them. I've always been a big proponent of the axiom "Know your Data" and to that end I've been looking at some of the distinct types of data in the Big Data Information Landscape to cut through the mystery of what information quality may mean in this new context.
An Internet of Things, or Sensors, Sensors, and more Sensors!
There's nothing like an Internet of Things to help drive Big Data. It seems practically any mobile device can become a sensor these days, not to mention the range of RFID tags, machine sensors for weather, water, traffic, etc. An iPhone 4 includes eight distinct sensors such as an accelerometer, a GPS, a compass, and a gyroscope. And these types of sensors are driving new initiatives such as Smarter Cities. A good example of such use is SFpark helping drivers find parking spaces in San Francisco through 8200 parking sensors.
But what's in this Sensor data?
From a data quality or governance perspective, there's obviously a large range of possible data generated, but I was curious to see what some examples actually look like. I started browsing something publicly available, specifically data from the National Weather Service. The data comes from ~1800 tracking stations generated at hourly intervals on a daily basis. While feasible to look at some raw text data, there are two primary forms of data available: RSS and XML (and the RSS is just more truncated XML). You can get individual station data or zip files of all the data for a given time period. Overall, it makes for a nice starting point in getting to "Know your Data"!
Just thinking about the weather
I grabbed some zip files of both XML and RSS for three days at a couple time intervals and extracted the files. I found 4165, 4169, and 4171 files respectively by date of the format XXXX.xml or XXXX.rss. Just at this level, I had some immediate thoughts on information quality measures:
- Did I pull the right file type?
- Do the contents match the stated file type?
- And given the lack of date in the file name, is this data I've already picked up?
Nothing unusual at this level -- if anything, it's business as usual.
Opening up the XML file for station KSTP (St. Paul, MN -- station call letters I was very familiar with growing up), the file is run-of-the-mill XML.
There's a Location, a Station ID, Latitude/Longitude, Observation Time, Temperature, and various Wind measurements. All nice structured content which means I could check for: Completeness (does the data exist), Format (does the data conform to expected structure), Validity (is it in the right value set or range). Checking out a subsequent day's record, I found some variation in the fields provided--typical for XML you can choose to include or not include certain fields so additional checks could be made for Consistency vs. the XML schema or Consistency over time intervals.
Though not occurring in these samples, but certainly feasible for sensor data, is the possibility of sensor diagnostic or error codes. For instance, a temperature value of -200.0 could be an indicator that the sensor had an error condition and uses that available field to pass on the diagnostic. Depending on whether the sensor is external or internal, this may be an item to note as Incomplete or may be an item to trigger some notification/alert process.
Have you ever seen the rain?
It's quite possible that an individual station reports weather that appears Complete, correctly Formatted, Valid, and Consistent and still have quality issues. Some additional factors to consider:
- Are there data points for all intervals or expected intervals? This is a measure of Continuity for the data and can be applied for both individual sensors or groups of sensors.
- Is there Consistency of data across proximate data points? If St.Paul, MN and Bloomington, MN both show temperatures at 84.0 F, but Minneapolis, MN shows a temperature of 34.0 F, the latter is probably an error as you would not expect that sharp a temperature variant in that close proximity of space.
- Is there repetition/duplication of data across multiple recording intervals? There could certainly be the same data from a given sensor over multiple time periods, but is there a point at which these become suspicious and suggest an issue with the sensor?
- Is there repetition/duplication of data across multiple sensors? There could be the same temperature, humidity, and wind for St.Paul, MN and Minneapolis, MN, but do you expect the exact same measurements between the two hour after hour? The samples I looked at certainly indicate some marginal variation consistent with different recording points.
Given the volume of data points and the velocity or frequency of delivery, these may be as important as measures for Completeness or Validity if they are critical to analytic use. All of these can be monitored and followed over time as well, giving additional insight into trends of information quality.
The answer is blowin' in the wind
With some understanding of the data content and potential points of data quality failure, I come back to the value or fitness for purpose of the data. If I'm evaluating the impact of the weather on my store-based sales vs. my online sales, I may want to correlate the hourly weather readings of stations close to my stores and close to my customer's billing addresses. Hourly gaps may impact this analysis, but I may be able to smooth over such gaps with other nearby sensor readings.
If I'm evaluating daily sales only leading up to Christmas, I may only care about the aggregate weather for the day such as Min/Max Temperature and Total Precipitation. Two or three out of 24 possible data points may be quite sufficient for my needs, and the impact of specific data quality issues from a given sensor drops with an increase in available data points for the time period or the general area. And conversely, if I only have one sensor with very sporadic data near a given store or customer, the impact of data quality issues grows significantly.
This suggests that the weight of given measures for data quality is not constant for sensor data, but is variable depending on factors in its use. And one additional quality measure may be an identification of the fit of the sensor data coverage/measures to the data I wish to analyze against it (i.e. if I'm evaluating a store in an area where no sensors exist, I've got nothing to evaluate against).
What else from sensors?
The Internet of Things, the instrumentation of many, many devices, will have a profound impact on the variety, volume, and velocity of incoming data to evaluate. Certainly this is just one example of the type of information available from sensors. However, in stepping through familiar data such as weather observations, not only do our common information quality measures hold up, but there are additional measures that can be put in place for ongoing monitoring. What becomes interesting is how the aggregation of such data may shift the quality requirements and the associated impact.
Do you have other examples? I'm curious how well these information quality measures hold with the range of sensor-based data that is emerging.
As always, the postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.