Text analysis can extract more than one concept from the
text of a single input row.
Data mining alone cannot deal well with set-valued input-fields.
However, you can create set-valued input fields with text analysis
and use these with data-mining algorithms.
With text analysis, you can detect concepts or keywords in the
overall records of patients, for example:
“smoker”, “physical inactivity”, “alcoholism”, “obesity”
By
combining text analysis with data mining, you can extract a set of
keywords for each single patient. For example, the patient with the
record ID 1 might have the following medical history:
"low fitness", "not sports", "stress", "obesity"
The following graphic shows the mining flow that extracts a set
of keywords for each single patient:
Figure 1. Mining
flow for set-valued input-fields
The mining flow above includes the following steps:
- The text operator Dictionary Lookup creates a set of rows for
each input row in the table HEALTHCARE.HEART.
- The preprocessing operator Item Aggregator creates an XML document
that includes a set of values that can be processed by the mining
algorithms Associations and Naive Bayes.
- The output of the Item Aggregator operator is joined with the
input table by using a join condition on the record ID. To ensure
that all input rows from the input table HEALTHCARE.HEART are included
in the join result even if text analysis does not find a concept or
a keyword in the text, a Right Outer join is used in the mining flow.
The
figure below shows the result of the join operation. Like the original
input table, it contains one row of information for each patient.
However, the free-text field is transformed into a structured field
that includes a set of values.
Figure 2. Output
of the Item Aggregator
- The result of the join operation is used as input for the Predictor
operator.
You can considerably improve association models and
classification models by including the keywords that are retrieved
by using text analysis on these models. For example, based on the
analysis of structured and unstructured data, 40% of the patients
might be eligible to be exempted from further intensive and expensive
medical supervision and control. This result cannot be achieved if
you use structured information only.
Naive Bayes classification
You can use set-valued
input-fields that are created by the Item Aggregator with the Naive
Bayes classification algorithm. The Naive Bayes classification algorithm
is provided by the Predictor operator.
In the Mining Settings
properties of the Predictor operator, you can specify the Naive Bayes
algorithm.
Figure 3. Specifying the Naive Bayes
algorithm
Additionally, you must set the field type of the input
column that contains the output of the Item Aggregator operator to
Set-Valued Categorical.
Figure 4. Setting the field
type of the input column that contains the output of the Item Aggregator
to Set-Valued Categorical
Associations
You can use set-valued input-fields
that are generated by the Item Aggregator operator with the Associations
algorithm.
The following figure shows how to set the field
type of the SYMPTOMS input column to Set-Valued Categorical in the
Column properties of the Associations operator.
Figure 5. Changing the field type of the input field SYMPTOMS to Set-Valued
Categorical