When you add a data set, IBM Watson Analytics reads the data and assesses it for data quality. The data quality score measures the degree to which the data is suitable for predictive analysis. Data sets with low quality scores may be suitable for data exploration even if they are not suitable for predictive analysis.
The overall score is an average of the data quality score for every field in the data set, as determined by missing and constant values, influential categories, outliers, imbalance and skewness. In this example from SportsData_NFL_2014_REG_PST_players.csv (which is available here), Watson Analytics excludes fields with more than 25% missing values and fields with constant values.
You access the Data Quality Report from a prediction, using the menu in the upper-left corner.
The Data Quality Report highlights areas where you could optimize your source data. Adding more rows and columns to the data often improves the quality of the data. The more data that Watson Analytics has available to choose from, the more accurate its results are.
Note that you can choose to include a field that Watson Analytics has excluded; for example you may want to use a field that has more than 25% missing values because you know this field is important to your analysis. In this case, use the Predict Menu to select Field Properties, change the role of the field to input or target, and regenerate your prediction. This action may affect the quality of your prediction.
How to influence data quality?
Do your best to clean your data before you add it into Watson Analytics. List files work best. Some of the typical issues with data sets can be resolved by:
- Removing blank rows from your data file
- Removing summary rows and columns from your data file
- Eliminating column headings and row headings that appear in the same cell
- Avoiding look up tables
- Avoiding subtotals and aggregations
More tips for cleaning your data before uploading to Watson Analytics:
- Watson Analytics assumes that the first row of your file contains headers files; descriptive column headers are preferred.
- You must have a header for every column. The number of columns in the header row is assumed by Watson Analytics to be the number of columns of data. For example, if the first six columns have headers but there are eight columns of data, the last two columns of data are ignored.
- You cannot have empty columns inserted before the data.
- You can have empty rows above the data. Empty rows preceding the data are ignored.
- You cannot have textual rows above the header row. For example, if you have a title or description of what the data is about above the header row, the file is not read appropriately.
- You cannot have textual rows following the data. For example, a row following the data that says “This information came from…” is considered to be part of the data.
More details are in this helpful document: Introduction to Data Loading and Data Quality, including specific conditions that apply to MS Excel and CSV files.