Exploring multivariate value distributions

You can visually explore value distributions and relationships between the columns of a table. You can also retrieve basic statistics of a table, for example, the number of NULL values or the number of different values in the respective columns. In this context, the column of a table is called a field.
To explore multivariate value distributions in the Design Studio, follow these steps:
  1. In the Data Source Explorer, expand a database connection and navigate to the Tables folder.

  2. Expand the Tables folder and right-click the name of a table, for example, the table HEALTHCARE.HEART in the schema HEALTHCARE, select Distribution and Statistics > Multivariate Distribution....

    The Input Data Selection dialog is opened.

  3. In the Input Data Selection dialog, accept the default settings by clicking OK.

    The Multivariate Distribution view is opened. It shows the Field statistics table that includes statistics for each field of the selected database table. In the Field statistics table, the fields of numeric types such as INT or REAL are labeled as continuous, the fields of various char-based types are labeled as categorical, and date/time/timestamp fields are labeled as temporal.

  4. In the Field Statistics table of the Multivariate Distribution view, select the check boxes of the fields to be explored, for example, AGE and BLOOD_PRESSURE.

    In the Charts for selected fields section of the Multivariate Distribution view, the value distribution charts of the fields AGE and BLOOD_PRESSURE are opened. They show the value distribution of the total population of these fields as shown in the following figure:

    Figure 1. The Multivariate Distributions view
In the Multivariate Distribution view, you can do the following tasks:
Changing the input data set
You can change the input data set to be explored by clicking Change... to open the Input Data Selection dialog.

In the Input Data Selection dialog, you can reduce the data that is used for exploring multivariate distribution. By reducing the data, you can achieve a better performance. For example, you might want to limit the rows that are extracted from the input table, or you might want to remove columns that are not of interest for you. After clicking OK, the new set of input data is loaded for further exploration.

The filter condition that you specified in the Input Data Selection dialog is saved by the Design Studio. The next time you run multivariate distribution on the same table, the Input Data Selection dialog is using the filter condition that you set in the previous session. This works also after you closed and restarted the Design Studio.

Changing preferences
You can change the preference settings of the Multivariate Distribution view. For example, to change the color settings of the bar charts, follow these steps:
  1. In the toolbar above the value distribution charts at the right, click .
  2. In the Preferences dialog for multivariate distributions, modify the color settings of the bar charts and click OK.

Now the bar charts are displayed in the color of your choice.

Comparing the value distribution between the total population and a subgroup population
You can display the value distribution of a subgroup population of your interest by following these steps. The value distribution of the total population is displayed in the background so that you can easily compare the value distribution of the subgroup to the value distribution of the total population.
  1. In the Field statistics table, select the check boxes of the fields whose value distribution charts you want to have displayed in the Charts for selected fields section, for example, AGE, SEX, and BLOOD_PRESSURE.
  2. In the value distribution chart of the field AGE, click the bar that represents patients between 50 and 60 years.
  3. In the value distribution chart of the field SEX, click the bar that represents male patients.
  4. In the value distribution chart of the field BLOOD_PRESSURE, observe the value distribution of the selected patient group.

    The BLOOD_PRESSURE chart shows that male patients in the age between 50 and 60 years have approximately the same blood pressure distribution though there is a tendency to slightly higher values.

You can select any combination of bars, for example:

  • To select more bars in a row, press SHIFT and click the first bar and the last bar in the row. You can also select multiple bars by clicking in the chart, holding down the left mouse button, dragging the mouse diagonally across several bars, and releasing the mouse button.
  • To clear the bar selection in a chart, click an empty space in the chart.

Gaining detailed insight
You can get more detailed information about the multivariate distribution of a numerical field by changing the width of the bars in the chart.

For example, to get more insight about the multivariate distribution of the field BLOOD_PRESSURE, follow these steps:

  1. Select the BLOOD_PRESSURE chart.
  2. In the toolbar above the charts, click .

    Because a numerical field is selected, the Change Intervals dialog is opened.

  3. In the Change Intervals dialog, type 2 in the Interval length entry field and click the Arrow button.

    You need not specify values for all parameters. For example, if you specify only the value for the interval length and click the Arrow button, the system generates the appropriate values for the other parameters and displays them in the Validated Value box as shown in the following figure:

    Figure 2. The Change Intervals dialog
  4. On the Change Intervals dialog, click OK.

Now the value distribution chart shows the blood pressure distribution in intervals of length 2.

If a categorical field contains a large number of values, it is difficult to display bars for all values in the chart. You can specify the maximum number of values to be displayed in the chart to get a better overview.

For example, to specify the maximum number of values to be displayed in the value distribution chart of the field PAIN_TYPE, follow these steps:
  1. Select the PAIN_TYPE chart.
  2. In the toolbar above the charts, click .

    Because a categorical field is selected, the Change Categories dialog is displayed.

  3. In the Change Categories dialog, specify the maximum number of values to be displayed and click OK.

In the value distribution chart, the most frequent values are now represented by individual bars. All other values are grouped together in another bar that is called Other.

Hiding fields
In the Field statistics table, to gain a better overview of the fields that you are interested in, you can hide fields by typing a text string in the Filter entry field. Only the fields that start with the specified text string in the Filter entry field are displayed.

You can use an asterisk (*) as a wildcard.

Normalizing view
In the value distribution chart of the field BLOOD_PRESSURE:
  • 17 patients might have blood pressure values between 130 and 140
  • 9 patients might have blood pressure values between 140 and 150
This means that in the selected group, there are fewer patients with higher blood pressure than patients with lower blood pressure. However, these numbers do not imply that the patients in the selected group are healthier than the overall population. For a correct conclusion, you must compare the percentage of the patients from the selection group who have higher blood pressure values with the percentage of patients who have lower blood pressure values.

To show the percentages of patients in the selected group against the overall population, click in the toolbar. Now you can see that the percentage of the patients from the selected group with higher blood pressure is significantly higher than the percentage of patients with lower blood pressure.

Sorting bars in a chart of a categorical field
By default, the bars in value distribution charts of categorical fields are sorted by their string values in ascending order. You can change this sort order by right-clicking a chart, for example, PAIN_TYPE, and selecting Sort Bars > By Frequency Descending.

After the bars in the PAIN_TYPE chart are sorted according to the frequencies of the pain types in descending order, you can easily identify the most frequent pain type the patients are suffering from.



Feedback