I regularly check out the articles from FastCompany's twin sites Co.DESIGN http://www.fastcodesign.com/ and Co.LABS http://www.fastcolabs.com/. I really enjoy their mix of informative and eclectic articles, and the former particularly incorporates some very interesting and intriguing infographics.
As I caught up from my recent vacation, one recent infographic that caught my eye was called "The United States of Burgers" (http://www.fastcodesign.com/1673006/infographic-the-united-states-of-burgers#1), a somewhat whimsical look at the most popular fast food burger joints by city put together by PeekAnalytics (http://www.peekanalytics.com/burgerjoints/). They note on their site, "For the past month, PeekAnalytics tracked millions of Tweets of fast food burger chains. This map shows which restaurant was the most popular in over 12,000 cities across the USA." You can look at a nation awash in burger joint logos, cull it back to ones of interest (and quickly see the dominance of golden arches, Burger Kings, and Wendy's across the country), zoom into particular states or even distinct cities.
What's in the Graphic?
For those of us looking at Big Data and quality of information, there's some useful insights to gain from even this fun little graphic. After all, this is the crux of social media feeds -- culling out data that you can pair with your own internal data such as products and product sales.
First, consider what we do know from the statement above and the infographic itself:
- The data source is Twitter
- The data covers a one month time period -- stated as the past month (probably June 2013 since the article appeared in July 2013)
- The data is relevant for 12,000 cities in the US
- There were millions of Tweets included
- The Tweets had some mention of 26 named brands of fast food burger joints.
We can look at the graphic at various levels and potentially ascertain various facts:
- There are more McDonalds logos than Sonic logos
- Krystal has clusters of popularity in North Carolina and Georgia
- La Crosse, WI prefers Burger King, but Eau Claire, WI prefers What-a-Burger and River Falls, WI prefers Hardees
- If I'm travelling east on the interstate across central New Mexico, I'm not going to find much choice unless I take a left turn at Albuquerque
But consider what we don't inherently know:
- What was the collection criteria?
- Who made the tweets?
- How were the brands identified?
- How was popularity measured?
- How was geographic location assessed?
- What's really reflected in the marker for a city?
Is the data 'nutritionally' valuable?
As a data scientist or analyst, it's necessary to dig into what we don't know. Some of the important questions reflect basic aspects of data governance and quality, some reflect evaluation of the analysis in a larger context. Just thinking about the former, I could consider the following:
- Is there a bias in the collection method? These tweets are made by people who have Twitter accounts and like to express either where they have been or an opinion about the burger joint. Or does the collector of the data have an interest in the data? While the data may be usable, it may not have sufficient quality to give realistic insight.
- Was all relevant data collected? Suppose we forgot to include critical hashtags? Maybe a common reference to McDonalds is #MickeyD's and failure to include it significantly skews the results. How would we understand and record a level of completeness in the data?
- Was context of the tweet captured and reflected? Did we discriminate between positive and negative comments or does that even matter? Should we record and measure something of this dimension (and what would we call it if we did)?
- Was geography based on the Twitter user, the place of the tweet, or the place of business? Can we even tell? Potentially we could match the geographic coordinates to our known brand locations in this case to capture some distance measure. Perhaps the consistency of geographic coordinates to business location would help ensure better quality?
- Should a distinction be made between cities where there is a clear preponderance of tweets for one brand vs. those where there is statistically insignificant variances between the top brands? Or do absences of certain brands reflect the lack of those brands in the city? At this point, there is a fine line between what may reflect a quality of data dimension (a measure of statistical significance) and an analytical or business dimension.
A dash of local knowledge
Local knowledge can help considerably in looking at the data. I noted earlier that the most commonly referenced burger place in River Falls, WI was Hardees. I know River Falls well - it's where I grew up and still visit periodically. With a population of 15,000+ and a university, it's now the largest suburb of the greater St.Paul/Minneapolis region. The city currently has three fast food burger places: McDonalds and Burger King on the north edge of the city heading towards the interstate, and a Dairy Queen near the university. The closest Hardees is in Baldwin, WI, some 20 miles away (though there is a Hardees in Black River Falls, WI some 115 miles away), though if I remember correctly there once had been a Hardees in River Falls near the university, but some years ago now.
What other questions can we ask based on this local knowledge?
Why is Hardees the most referenced burger place when there isn't one there?
- Does it reflect a comparison or preference for what is not there such as wishful thinking or nostalgia?
- Does it reflect a proposal to bring a Hardees' franchise into the city?
- Is there something else in the tweets that we should correlate for to determine usefulness or value?
- Were the tweets inappropriately linked to River Falls, WI when they were actually for Black River Falls, WI? (such as a potential failure/error in geospatial or matching logic.)
- Should a filter or correlation of actual burger franchises have been applied against the data? Or is it valuable to see the range of references regardless of whether a burger joint actually exists in the city?
The detail or the aggregate?
If we're gathering this information on a regular basis for ongoing analysis, it may be as important for us to look at data quality from an aggregate as well as a detail perspective. Yes, we can measure whether the individual tweet has relevant geographic coordinates and usable hashtags and some set of useful text expressions, but the aggregate may be more meaningful with social media data if it meets the right parameters and fits our needs.
- How many records did we receive this month?
- Were there shifts in geography for the month?
- Were there shifts in positive or negative views for the month?
- Were there shifts in references for one burger franchise vs. another?
Once we've identified that a given dataset has sufficient 'nutritive' value for our organization and added some local knowledge as a useful check, these aggregate measures can help indicate shifts in content that could impact how and how well we can utilize the information over time.
As always, the postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.