Use Cases

Watson Analytics That 70s Data Use Case: Exploring the Auto MPG Data Set

Blog Home > Watson Analytics That 70s Data Use Case: Exploring the Auto MPG Data Set

Watson Analytics That 70s Data Use Case: Exploring the Auto MPG Data Set

Flashback to the 1970s, when cars were big, heavy and used lots of gas. The Auto MPG sample data set is a collection of 398 automobile records from 1970 to 1982. It contains attributes like car name, MPG, number of cylinders, horsepower and weight. With Watson Analytics, I was able to use modern capabilities to quickly explore and predict the relationships between retro MPG, horsepower and weight data. You can use this data to practice some useful analysis techniques and visualizations that you can then apply to your own data sets.

Here’s a quick overview of the data and its relationships. I created this image in Watson Analytics Assemble with some key visualizations that I saved from Explore.

explore_datasets_analytics

Want to try this same data set? Download a copy of the data from the University of California (UCI) Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The actual data is in the “auto-mpg.data” file and the column names are in the “auto-mpg.names” file.

The raw data file needs a title row before uploading, so I used a text editor to add the following column names as the first row:

mpg_cylinders_horsepower_weight_acceleration

After that, I saved and named the file in .csv format. For example: “auto-mpg.data.csv”. The file looked like this.

auto_mpg_csv_files

Exploring the data and relationships

The first thing I did was use Explore to visualize the relationship between MPG, horsepower and weight. Unfortunately, more horsepower means a heavier car and lower fuel efficiency (MPG). Here’s an example I created in Explore to show this downward trend with some notes I added.

horsepower_heavier_fuel

Quick Tip: Turn on data labels to see the actual car names. Use the Show item labels option to display labels at each data point.

display_labels_visualization

Another way to analyze horsepower is to look at the main attributes of a car’s engine: number of cylinders and engine size (displacement). A more powerful engine (as measured by horsepower) usually means more cylinders and a larger displacement value. This relationship is shown in the following visualization.

Here, I combined horsepower and weight while also displaying the data grouped into three clusters based on number of cylinders (4, 6, or 8). I assigned the bubble size to represent engine displacement.

horsepower_weight_cylinders

How about looking at where these cars were manufactured? Here’s a tree map visualization showing the breakdown by where some of the automobiles were manufactured (field = origin). In this case, I did some preprocessing to extract the car make from the combined make and model text. I also used Refine to re-encode the values for origin (1, 2, 3) into “North America”, “Europe”, and “Asia”. More on this in a future blog dedicated to Refine.

manufacturing_region

Verifying the trends in Predict

With Predict, I was able to verify the main trends I saw when exploring the data: weight, horsepower and engine displacement all impact MPG. The main prediction screen summarizes these impacts with color-coded visualizations for the target (MPG) and each of the main predictor attributes.

mpg_predictors_model

Here’s a closer look at the details that display when you hover over the predictors in the spiral diagram. I took some screenshots, combined them in an image editor, and then added some text to create the following basic infographic.

mpg_predictors_infographic

The Predict feature also provided some deeper and more statistical insights into these findings.

Weight has a negative impact on MPG (negative correlation)
I clicked the top predictor, wt drives mpg, to view the following main insight. This insight displays the negative correlation between MPG and weight by showing different groups of weight values. Color intensity is used to denote the related ranges of MPG values.

weight_values_mpg

More horsepower means more weight (positive correlation)
Here’s an example of the positive correlation between horsepower and weight with some added notes.

horsepower_weight_correlations

I wanted to see an approximation of the correlation, so I turned on the following option to display a smoothed line that represents the relationship between weight and MPG.

correlation_weight_mpg

A smoothed line displays as a fit to the data.

data_fit

More horsepower also means lower MPG (negative correlation)
The flip side to lots of horsepower usually means more weight, too, which of course means lower MPG. For this visualization, I added some notes to emphasize the negative correlation.

correlation_horsepower_weight

Here’s the same visualization about horsepower and MPG, but with the smoothed line displayed to approximate the correlation.

approximate_data_correlation

Combining visualizations and infographics to communicate these findings

After using Explore and Predict, I was ready to jump into Assemble so I could combine multiple visualizations and create some infographics.

I started by creating this word cloud of all the car names in the data set, filtered for 1970 to 1974, sized by horsepower and color-coded by number of cylinders.

car_names_word_cloud

Here’s a combination I created with a packed bubble and a word cloud visualization. I filtered this data to show only cars with 8 cylinder engines from 1970 to 1974.

8_cylinder_packed_bubble_word_cloud

In this next example, I took one of the bubble plot visualizations I saved from Explore and added it into a new view. I then enhanced the visualization with images that I found on Wikipedia of some of the exact cars in the data set.

enhanced_bubble_chart

Displaying web pages in a view

Watson Analytics also enables you to include web pages in a view so you can build interactive “information and data mashups.” Here are some examples, but more on this in a future blog.

visualization_mashups

In this example, I added a word cloud and the Wikipedia search web page into the same view. I used this as a quick way to look up and research cars from the data set by copying and pasting car names into the search box of the embedded web page.

web_search_wikipedia_visualization

By blending data and web pages, I created a dynamic and interactive mashup of data and information.

dynamic_interactive_mashup

What does all this mean for you?

As I mentioned earlier, exploring publicly available data is a very good way to practice using Watson Analytics so you can confidently use it on your own data in the future. In addition, even historic data can be used to help make business decisions. In this case, the insights I found could be used today to help the auto industry build more fuel efficient cars as emissions regulations continue to tighten. Or, the auto industry could compare this data with their own current MPG data to identify where there have been improvements and where more work needs to be done. You can access the data from https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Imagine what you can do with Watson Analytics. It’s easy to get started. Just visit www.watsonanalytics.com and sign up for free.

Go pro and get more of what you love about Watson Analytics. Learn more by viewing the Watson Analytics Professional Demo.

Data set reference

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Images from Wikipedia

“1974 Pontiac Grand Safari” by Josephew at English Wikipedia. Licensed under CC BY-SA 3.0 via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:1974_Pontiac_Grand_Safari.jpg#/media/File:1974_Pontiac_Grand_Safari.jpg

“1973-1978 Honda Civic 5-door hatchback 01” by OSX – Own work. Licensed under Public Domain via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:1973-1978_Honda_Civic_5-door_hatchback_01.jpg#/media/File:1973-1978_Honda_Civic_5-door_hatchback_01.jpg

“Pontiac Catalina front” by IFCAR – Own work. Licensed under Public Domain via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Pontiac_Catalina_front.jpg#/media/File:Pontiac_Catalina_front.jpg

More Use Cases Stories

Explore

Who Gets Angriest About Airline Delays? The Answer Might Surprise You.

When looking at the recent blog on network diagram visualizations in the Watson Analytics community.   I was interested to see how a network diagram can show relationships even in dense datasets. Looking at the dataset in more detail, I was curious as to which origin cities had more delayed or cancelled flights and it got me thinking about people’s reactions to the flight delays. To find out, I compared delayed flights from November 2015 through January 2016 and sentiment on flight delays for the same time period found within Social Media. For the analysis, I chose larger cities. Using the Refine feature, I selected flights that were at least 30 minutes delayed or longer for the departure time and selected the major cities. The major city that had the most occurrences of flight delays was found to be Chicago, which may not be a surprise since it is a major hub and weather in this timeframe is not ideal. As you can see, many hubs saw flight delays, including Atlanta, Denver and Los Angeles. If there was a delay over 30 minutes, the average delay time was 84 minutes for Chicago, but it was not the highest. The lowest average delays times were in Baltimore, Las Vegas and Los Angeles. Now let’s compare this to volume of social conversations about flight delays for the cities. I created a social media project, which captures conversations about flight delays (in English) as shown in the following diagram. I also created Themes for major cities to get a sense of what cities people are referencing when talking about delayed flights. After running the analysis on social media, I am able to see that Los Angeles, New York and Chicago have the largest conversation volume for the same time period. Let's focus on sentiment using a network diagram. The network diagram works well for showing how negative conversations are, which is expected as we are evaluating flight delay conversations. More importantly, you can select any of the nodes in the diagram to highlight the relationship as we see below. By clicking Los Angeles, I see Los Angeles is predominantly negative; this is surprising considering the occurrences happen more for Chicago. I can also look at the cities by the number of mentions and filter that based on negative sentiment as shown in the next visualization, which shows that Los Angeles is much more negative than the other major cities. Next I want to see all of the visualizations in a single view. By pinning each of these visualizations to the collection, Watson Analytics makes easy for me to create dashboard with key elements from both datasets. From this dashboard, it is easy to decide that the Los Angeles audience is much less tolerant about flight delays. If you were working in the airline industry, you may want to pay particular attention to the Los Angeles market when dealing with delayed flights. While I have combined social data with non-social data in the past, I have not done it with this much ease. I strongly recommend that you use analytics with social data together with your transactional data. The observations that you can derive with social and non-social data can be very interesting. You will be able to vet the data and insights better than with a single data set.   Try it out yourself!

Use Cases

#GoGreenGo: Using Watson Analytics for Social Media on St. Patrick's Day

If you are Irish or just support the St. Patrick’s Day rally cry, find out what is top of mind with your compatriots. I happen to know university campuses are all a buzz with the green spirit because I drove by some this morning and saw a good amount of green hats milling about. But what are the topics that are capturing the interest of the leprechauns of this day? I did a quick peek at Watson Analytics topic suggestions to find out what people are chattering about. Below you will see some interesting Topic Suggestions from Watson Analytics for Social Media for St. Patrick’s Day. After a quick analysis, the demographics show a close, almost a 1:1 ratio, for males and females but there were slightly more females talking about St. Patrick’s Day. That is interesting. I was not expecting this demographic. Here is a breakdown of the things people were talking about. I am a little surprised that “luck” is in the top three conversation themes. I was expecting beer, wear green or parade to be up on this list. I guess you don’t know what you don’t know until you do the analysis. I am going to stop here and put on my green cap, shamrock and paint the town green.  I encourage you to discover your own “Pot ‘o Gold” with Watson Analytics! Let me know what you find with your comments on the community forum!  It’s fun, easy and insightful  If you are not already using Watson Analytics, sign up for free here!

Explore

Visualizing network data to illustrate airline delays

Watson Analytics has recently been expanded with a new set of visualizations that can help you find more informative answers to your data questions quickly. In this blog post, I highlight these capabilities in the Freemium version of Watson Analytics which (among others) help users better visualize network data. Network data is a very common, but an underused data type. Network data conceptually consists of a collection of items and a collection of connections between a pair of items. Items, in this case, could be people on a social network site and a connection could exist if person A has friended person B. Or, items could be warehouse locations and a connection means there is a direct supply route between both locations. Many real world problems can be modeled as networks, and should be presented back to the user as visual network as well. However, typical business intelligence tools don’t typically support querying or visualizing network data. To illustrate how Watson Analytics can help, I’ll use a dataset obtained from the US Bureau of Transportation Statistics that describes airline departure and arrival delays for all US domestic flights, which can be downloaded for free here.  This dataset has, for each US departing flight, the flight carrier, flight origin and destination, as well as amount and reason of delay. In this case we’ll use part of the data for December 2014, which consists of a little more than 500 thousand rows. Since the dataset contains departure and arrival locations in separate columns, I should be able to extract a visual route map of each airline by treating a single flight as a connection between an origin and a destination city. Loading up the data in Watson Analytics and starting a new data exploration quickly gets me to the following screen. Rather than having to shape the data into a network or loading the data up in an external network visualization tool, I can ask Watson Analytics for the connections between each origin and destination city in the data. Clicking the most relevant suggestion shows me all connections between all domestic US airlines origins and destinations. In passing, Watson Analytics has also detected that destination state forms a hierarchy with destination city and has auto included state names for each city to help disambiguate them. Each city (called a node) is indicated with a blue dot, and a line (also called an edge) is drawn between two cities if the dataset contained a flight connecting both of them. The size of each dot is proportional to the number of in and outbound connections, so hubs tend to have a larger size. The visual weight of each edge is proportional to the number of rows, but we could easily change this to the total incurred minutes of delay by changing the line weight mapping. Althouh this network diagram is not very readable because there is a very dense cluster of major cities in the center, I can already see Alaskan cities on the periphery of the large central blob. This means that airports like Wrangell, AK, are at least 4 stops away from any other airport in the US. However, a good way to break apart a complex diagram like this is to filter the data down by carrier. In the graphic below I’ve extracted route information for three airlines by simply using the filter capabilities at the bottom of the user interface. You can clearly see the differences in carriers size of network, hubs and geographic region. In this case I’ve used a network visualization to show connections between cities where the data contained both origin city and destination city fields. In the same way, you could use Watson Analytics to show a diagram of connections between friends in a social network, or if that interests you, connections between genes in a genetic network. A different but also useful case for network visualization is where you have two different sets of items, and you want to show how one type of item such as a customer relates to another type of item such as a product. These types of networks are technically called bipartite networks because their set of nodes falls into two sets and you only see connections between items in different sets, not among items in the same set. In this case, I can choose to map carriers and destination cities, to highlight geographic differences between carriers. I can easily modify my visualization by dragging the “Destination City” attribute to the ‘From’ slot in the visualization and dragging “Carrier” to the ‘To’ slot. One group of nodes (green) represents the carriers, the other (blue) represents cities. This visualization clearly shows geographical differences between airlines. State based airlines like Alaska or Hawaiian are obviously catering to these remote states. SkyWest caters to most small airports in the northwestern states, while American Eagle and ExpressJet operate in smaller airports in the south east. Bipartite graphs are useful if you want to see how different data items associate with other types of data items. These could be age groups and products, tweets and locations or stores and products. Computing layouts for network diagrams is very computationally expensive, but if there are at most a few thousand items, the network diagram can be applicable to almost all datasets. We also deployed two other visualizations that help users quickly determine the most salient items in a set. The wordcloud is a well known visualization type that uses a compact arrangement of words to show the most salient terms in a larger set. Individual words can be colored by a different attribute if needed. In this case we can use the wordcloud to show the carriers with the highest average arrival delay: The packed bubble is a highly scalable visualization that compactly represents data items as circles, where the radius of the circle represents a value of interest. Here we show the largest arrival delays for all 300 departure cities. The below visualization shows the average departure delay in Macon, GA, for December 2014 was 67 (!) minutes. I hope I have shown you how to get more out of your data using these new visualization types. They’re freely available to try for everyone, so please let us know what you think of them. Frank Van Ham Master Inventor. Information Visualization and Visual Interaction Expert. IBM Watson Analytics Architect