You
can visually explore value distributions and relationships
between the columns of a table. You can also retrieve basic statistics
of a table, for example, the number of NULL values or the number of
different values in the respective columns. In this context, the column
of a table is called a field.
To explore multivariate
value distributions in the Design Studio,
follow these steps:
- In the Data Source Explorer, expand a database
connection and
navigate to the Tables folder.
- Expand the Tables folder
and right-click the name of a table,
for example, the table HEALTHCARE.HEART in the schema HEALTHCARE,
select .
The
Input Data Selection dialog is opened.
- In the Input Data
Selection dialog, accept the default settings
by clicking OK.
The Multivariate Distribution
view is opened. It shows the Field statistics table
that includes statistics for each field of the selected database table.
In the Field statistics table, the fields of
numeric types such as INT or REAL are labeled as continuous, the fields
of various char-based types are labeled as categorical, and date/time/timestamp
fields are labeled as temporal.
- In the Field
Statistics table of the Multivariate
Distribution view, select the check boxes of the fields to be explored,
for example, AGE and BLOOD_PRESSURE.
In the Charts for
selected fields section of the Multivariate Distribution
view, the value distribution charts of the fields AGE and BLOOD_PRESSURE
are opened. They show the value distribution of the total population
of these fields as shown in the following figure:
Figure 1. The Multivariate Distributions view
In the Multivariate
Distribution view, you can do the following
tasks:
- Changing the input data set
- You can
change the input data set to be explored by clicking Change... to
open the Input Data Selection dialog.
In the Input Data Selection
dialog, you can reduce the data that is used for exploring multivariate
distribution. By reducing the data, you can achieve a better performance.
For example, you might want to limit the rows that are extracted from
the input table, or you might want to remove columns that are not
of interest for you. After clicking OK, the
new set of input data is loaded for further exploration.
The
filter condition that you specified in the Input Data Selection dialog
is saved by the Design Studio. The next time you run multivariate
distribution on the same table, the Input Data Selection dialog is
using the filter condition that you set in the previous session. This
works also after you closed and restarted the Design Studio.
- Changing preferences
- You can change
the preference settings of the Multivariate Distribution
view. For example, to change the color settings of the bar charts,
follow these steps:
- In the toolbar above the value distribution
charts at the right,
click
.
- In the Preferences dialog for multivariate
distributions, modify
the color settings of the bar charts and click OK.
Now the bar charts are displayed in the color of your choice.
- Comparing the value distribution between the
total population
and a subgroup population
- You can display the value distribution
of a subgroup population
of your interest by following these steps. The value distribution
of the total population is displayed in the background so that you
can easily compare the value distribution of the subgroup to the value
distribution of the total population.
- In the Field
statistics table, select the
check boxes of the fields whose value distribution charts you want
to have displayed in the Charts for selected fields section,
for example, AGE, SEX, and BLOOD_PRESSURE.
- In the value distribution
chart of the field AGE, click the bar
that represents patients between 50 and 60 years.
- In the value
distribution chart of the field SEX, click the bar
that represents male patients.
- In the value distribution chart
of the field BLOOD_PRESSURE, observe
the value distribution of the selected patient group.
The BLOOD_PRESSURE
chart shows that male patients in the age between 50 and 60 years
have approximately the same blood pressure distribution though there
is a tendency to slightly higher values.
You can
select any combination of bars, for example:
- To select
more bars in a row, press SHIFT and click the first
bar and the last bar in the row. You can also select multiple bars
by clicking in the chart, holding down the left mouse button, dragging
the mouse diagonally across several bars, and releasing the mouse
button.
- To clear the bar selection in a chart, click an empty
space in
the chart.
- Gaining
detailed insight
- You can get more detailed information about
the multivariate distribution
of a numerical field by changing the width of the bars in the
chart.
For example, to get more insight about the multivariate
distribution of the field BLOOD_PRESSURE, follow these steps:
- Select the BLOOD_PRESSURE chart.
- In the toolbar above
the charts, click
. Because a numerical field
is selected, the Change Intervals
dialog is opened.
- In the Change Intervals dialog, type 2 in
the Interval
length entry field and click the Arrow button.
You need
not specify values for all parameters. For example, if you specify
only the value for the interval length and click the Arrow button,
the system generates the appropriate values for the other parameters
and displays them in the Validated Value box as shown in the following
figure:
Figure 2. The Change Intervals
dialog
- On the Change Intervals dialog,
click OK.
Now the value distribution
chart shows the blood pressure
distribution in intervals of length 2.
If a categorical field
contains a large number of values, it is difficult to display bars
for all values in the chart. You can specify the maximum number of
values to be displayed in the chart to get a better overview.
For
example, to specify the maximum number of values to be displayed in
the value distribution chart of the field PAIN_TYPE, follow these
steps:
- Select the PAIN_TYPE chart.
- In the toolbar above
the charts, click
.Because a categorical field
is selected, the Change Categories
dialog is displayed.
- In the Change Categories dialog,
specify the maximum number of
values to be displayed and click OK.
In
the value distribution chart, the most frequent values
are now represented by individual bars. All other values are grouped
together in another bar that is called Other.
- Hiding fields
- In the Field
statistics table, to gain
a better overview of the fields that you are interested in, you can
hide fields by typing a text string in the Filter entry
field. Only the fields that start with the specified text string in
the Filter entry field are displayed.
You
can use an asterisk (*) as a wildcard.
- Normalizing
view
- In the value distribution chart of the field BLOOD_PRESSURE:
- 17 patients might have blood pressure values between 130 and 140
- 9 patients might have blood pressure values between 140 and 150
This means that in the selected group, there are fewer patients
with higher blood pressure than patients with lower blood pressure.
However, these numbers do not imply that the patients in the selected
group are healthier than the overall population. For a correct conclusion,
you must compare the percentage of the patients from the selection
group who have higher blood pressure values with the percentage of
patients who have lower blood pressure values.To show the percentages
of patients in the selected group against the overall population,
click
in the toolbar. Now you can see that the percentage of the
patients from the selected group with higher blood pressure is significantly
higher than the percentage of patients with lower blood pressure.
- Sorting bars in a chart of a categorical field
- By default, the bars in value distribution charts of categorical
fields are sorted by their string values in ascending order. You can
change this sort order by right-clicking a chart, for example, PAIN_TYPE,
and selecting .
After the bars
in the PAIN_TYPE chart are sorted according to the frequencies of
the pain types in descending order, you can easily identify the most
frequent pain type the patients are suffering from.