Most data mining projects start with a data understanding
phase where you explore the data that is available for your analysis.
This tutorial introduces you to the data exploration functions in
the Design Studio.
This tutorial shows how to explore your data. You can navigate
the database in the Data Source Explorer.
- To explore the properties of the table BANKCUSTOMERS, complete
these steps: expand the database DWESAMP and open and select the table
BANKCUSTOMERS.
- In the Data Source Explorer, expand and select the table BANKCUSTOMERS.
When you select a table, you can explore the properties
of this table in the Properties view.
- In the Properties view, click the Columns tab.
The Columns tab shows the name and the data type of
the columns in the selected table.
- To explore the table content, right-click the BANKCUSTOMERS
table and select .
The Sample
Contents appears in the Result page showing rows in the table.
- To understand the value distribution and the relationship
between the columns in a table, complete the following steps:
- Right-click the BANKCUSTOMERS table and select .
The
Multivariate Distribution Input Data Selection window is opened.
- Accept the default settings to use a random sample for
the calculation and click OK.
The
field statistics of the table BANKCUSTOMERS is displayed.
- To maximize the view, double-click BANKCUSTOMERS.
The Field statistics table shows the number of rows that
contain a value, the number of rows that contain the NULL value, and
statistical information like minimum, maximum, and mean value for
each column.
- Select the check boxes next to the field names AGE,
GENDER, and MARITAL_STATUS to display the value distribution of these
columns.
By displaying multiple charts, you can explore
the relationships between the columns.
- Now you can close the BANKCUSTOMERS Field statistics
view.
- To explore the distribution of all other (independent)
variables for a specific dependent variable, complete the following
steps:
- Right-click the BANKCUSTOMERS table and select .
The
Compute Bivariate Distributions Target Field window is opened.
- Select MARITAL_STATUS as the dependent variable and
click Finish.
A progress
information status bar is displayed while the bivariate distributions
are calculated. When the calculation is completed, the Clustering
Visualizer is opened. The Graphics view shows each value of the dependent
variable (MARITAL_STATUS) in a separate row.
- To view the value distribution of customers with MARITAL_STATUS=widowed,
click inside the MARITAL_STATUS="widowed" box.
You
can see that the percentage of women is higher in the "widowed" segment
than in the total population of all customers.
- Now you can close the Clustering visualizer.