Data mining — Set-valued input-fields

Text analysis can extract more than one concept from the text of a single input row.

Data mining alone cannot deal well with set-valued input-fields. However, you can create set-valued input fields with text analysis and use these with data-mining algorithms.

With text analysis, you can detect concepts or keywords in the overall records of patients, for example:

“smoker”, “physical inactivity”, “alcoholism”, “obesity”

By combining text analysis with data mining, you can extract a set of keywords for each single patient. For example, the patient with the record ID 1 might have the following medical history:

"low fitness", "not sports", "stress", "obesity"

The following graphic shows the mining flow that extracts a set of keywords for each single patient:

Figure 1. Mining flow for set-valued input-fields

The graphic above shows a mining flow for set-valued input-fields

The mining flow above includes the following steps:

The text operator Dictionary Lookup creates a set of rows for each input row in the table HEALTHCARE.HEART.
The preprocessing operator Item Aggregator creates an XML document that includes a set of values that can be processed by the mining algorithms Associations and Naive Bayes.
The output of the Item Aggregator operator is joined with the input table by using a join condition on the record ID. To ensure that all input rows from the input table HEALTHCARE.HEART are included in the join result even if text analysis does not find a concept or a keyword in the text, a Right Outer join is used in the mining flow.
The figure below shows the result of the join operation. Like the original input table, it contains one row of information for each patient. However, the free-text field is transformed into a structured field that includes a set of values.
Figure 2. Output of the Item Aggregator
The result of the join operation is used as input for the Predictor operator.

You can considerably improve association models and classification models by including the keywords that are retrieved by using text analysis on these models. For example, based on the analysis of structured and unstructured data, 40% of the patients might be eligible to be exempted from further intensive and expensive medical supervision and control. This result cannot be achieved if you use structured information only.

Naive Bayes classification

You can use set-valued input-fields that are created by the Item Aggregator with the Naive Bayes classification algorithm. The Naive Bayes classification algorithm is provided by the Predictor operator.

In the Mining Settings properties of the Predictor operator, you can specify the Naive Bayes algorithm.

Figure 3. Specifying the Naive Bayes algorithm

The Graphic above shows how to specify the Naive Bayes algorithm in the Mining Settings properties of the Predictor operator.

Additionally, you must set the field type of the input column that contains the output of the Item Aggregator operator to Set-Valued Categorical.

Figure 4. Setting the field type of the input column that contains the output of the Item Aggregator to Set-Valued Categorical

The graphic above shows the Column properties of the Predictor operator

Associations

You can use set-valued input-fields that are generated by the Item Aggregator operator with the Associations algorithm.

The following figure shows how to set the field type of the SYMPTOMS input column to Set-Valued Categorical in the Column properties of the Associations operator.

Figure 5. Changing the field type of the input field SYMPTOMS to Set-Valued Categorical

The graphic above shows the properties of the input columns with the field type for the input column Symptoms to be changed to Set-Valued Categorical