Quality nodes

The Quality category nodes contain the following various quality filters:

Language annotator
De-duplicator
Document quality
PII and HAP annotator
Redaction
Annotation filter
Data class assignment
Terms and classifications
Classify documents

Any of these Quality category nodes are optional nodes to add to your flow. For a guideline on how to sequence the Quality nodes in the flow, see the Next node in the flow section.

Language annotator

The Language annotator node helps ensure accurate processing by detecting the language of documents before ingestion into the language model.

With the Filter if language cannot be detected toggle, you can define how to proceed when the language of the document cannot be recognized.

When set to On, such documents are filtered out from further processing. The final status of the node would then be Completed With Errors.
When set to Off (default), documents with unrecognized language are processed, the language lang_name is defined as "UKNOWN" and the language score lang_score is set to 0. The final status of the node is Completed With Warnings.

The following table lists the features that are added as operators to the language filter node:

Feature name	Data value returned	Description	Add value examples for criteria list
`lang_name` Language name	string	Specifies the two-letter identified language code or name according to ISO 639 specification (International Organization for Standardization). For example `en` for English.	`lang_name='en'` Filter documents in English.
`lang_score` Language score	float	Specifies the score of language prediction. The score is a decimal value between 0 to 1.	`lang_score > 0` Filter documents that contains the target language.

De-duplicator

The node removes duplicates of documents. No parameters are required.

Document quality

The Document quality node calculates and annotates several metrics that are related to a document, which are useful to see the quality of the document. You can profile documents to extract key metadata and structure. In addition, optimize the quality of the documents for ingestion into language models.

The following table lists the features that are added as operators to the document quality node:

Feature name	Data value returned	Description	Add value examples for criteria list
`docq_total_words` Word count	integer	The total number of words.	`docq_total_words >= 50` Filter documents where the total document word count is greater than or equal to `50`.
`docq_mean_word_len` Mean word length	double	The mean length of words.	`docq_mean_word_len = 3` Filter documents where the average word is three characters long.
`docq_symbol_to_word_ratio` Symbols to word	double	The ratio of symbols to words. Examples of symbols are emojis, ellipsis (`...`), and hash (`#`).	`docq_symbol_to_word_ratio >= 0.5` Filter documents where more than half of the document contains symbols than words. This query helps detect if the document is an image (opposed to a text) file.
`docq_sentence_count` Sentence count	integer	The number of sentences.	`docq_sentence_count >= 1` Filter documents where the document contains at least one full sentence.
`docq_lorem_ipsum_ratio` Lorem ipsum to text length	double	The ratio between the number of occurrences of `lorem ipsum` over the text length. Lorem ipsum (or sometimes known as lipsum) is random text that is used in laying out prints, graphics, or web designs.	`docq_lorem_ipsum_ratio < 0.05` Filter documents where less than 5% of the document contains random text. This query helps detect if the document is a file that contains readable text (opposed to an image) file.
`docq_curly_bracket_ratio` Curly bracket to text length	double	The ratio between the number of occurrences of `{` or `}` curly brackets over the text length.	`docq_curly_bracket_ratio >= 0.5` Filter documents where more than 5% of the document contains curly brackets. This query helps detect if the document is a file that contains code in text.
`docq_contain_bad_word` Any curse words?	boolean	Whether the text contains swear words.	`NOT (docq_contain_bad_word = false)` Filter documents where the documents do contain curse words.
`docq_bullet_point_ratio` How many lines start with a bullet point?	double	The ratio of lines that start with a bullet point.	`docq_bullet_point_ratio IN (0.1, 0.3, 0.05)` Filter documents where 10%, 30%, or 5% of the document starts with a bullet point.
`docq_ellipsis_line_ratio` How many lines end with an ellipsis?	double	The ratio of lines that end with an ellipsis (`...`).	`docq_ellipsis_line_ratio BETWEEN 0.5 AND 1` Filter documents where inclusively 50% to 100% of the document have lines that end with an ellipsis.
`docq_alphabet_word_ratio` How many words have at least one alphabetic character?	double	The ratio of words that have at least one alphabetic character.	`docq_alphabet_word_ratio = 0` Filter documents where the document contains no words. This query helps detect if the document is a binary coded file.
`docq_contain_common_en_words` Any English words?	boolean	Whether the specific `text` contains common English words, such as `the`, `and`, `to`, `that`, `of`, `with`, `be`, and `have`.	`docq_contain_common_en_words IS NOT NULL AND docq_contain_common_en_words = true` Filter documents where the document is not an empty file and must contain common English words.

PII and HAP annotator

The PII and HAP annotator node identifies and annotates PII (personally identifiable information) and HAP (Hate Abuse Profanity) to maintain data privacy during model ingestion.

Select if you want to redact just PII, HAP, or both.

Select which PII information you want to detect:

Bank account number
Credit card number
Email address
IP address
Phone number
Social security number

For the Redaction and HAP redaction toggle, specify whether you want to remove or mask any identified PII or HAP from the document.

If you want to redact the detected PII from the document, leave the Redaction toggle on. It is on by default. After the documents are processed for the annotator, each of the feature operator returns a 64-bit integer count value that specifies the number of occurrences of the PII detected in the documents. You can query on the count value that is returned with an Annotation filter node. You can find query examples that are listed in the table under the column labeled: Add value examples for criteria list when Redaction is enabled.

If the Redaction toggle is off, after the documents are processed with the annotator node, a dictionary is generated of all the PII detected, along with metadata of each detected PII features. The dictionary is the output in a JavaScript Object Notation (JSON) format. For more information, see the Understanding how the PII annotator processes documents when redaction is disabled section.

In the Masking character field, customize the character that you want to replace the detected PII or HAP from the document. You must specify the masking character as a single character. The data value is replaced with a string of similar length that is composed of the masking character. For example if the masking character is set to X, the phone number 510-555-1234 is replaced with XXXXXXXXXXXX. If no character is specified, any detected PII or HAP is not masked, but is redacted and removed from the document.

In the HAP threshold field, specify a number between 0 and 1 that determines how much HAP content should be removed from the document. It represents a sensitivity level.

The following table lists the features that are added as operators to the PII annotator node:

Feature name	Detection method	Description	Add value examples for criteria list when Redaction is enabled
`pii_bank_account` Bank account	Pattern match and context.	A bank account number between 8 to 17 digits.	`pii_bank_account >= 1` Filter documents where in the document exists at least one bank account number.
`pii_credit_card` Credit card	Pattern match and checksum.	A credit card number is between 12 to 19 digits.	`pii_credit_card = 3` Filter documents where in the document contains three credit card numbers.
`pii_email_address` Email address	Pattern match, context, and according to RFC-822 validation.	An email address identifies an email box to which email messages are delivered.	`pii_email_address > 5` Filter documents where in the document exists more than 5 email addresses.
`pii_ip_address` IP address	Pattern match, context, and checksum.	An Internet Protocol (IP) address (either IPv4 or IPv6).	`pii_ip_adress <= 8` Filter documents where in the document exists at most 8 IP addresses.
`pii_phone_number` Phone number	Custom logic, pattern match, and context.	A telephone number.	`pii_phone_number <> 0` Filter documents where the `count` of the phone number redacted is not 0.
`pii_ssn_details` Social security number	Pattern match and context.	A social security number (SSN) with 9 digits.	`NOT (pii_ssn_details = 0)` Filter documents where the `count` of the social security number redacted is not 0.

Understanding how the PII annotator processes documents when redaction is disabled

When you switch the Redaction off, the PII annotator keeps the data in the document unmasked. Instead of returning a 64-bit integer count value, the PII annotator stores a dictionary of all the PII features and collects metadata for each PII feature.

For example, the PII annotator is given these two document inputs:

The first document contains the following text:

Your email is support@ibm.com! 5340904586541378 Only the next instance of email will be processed. test@ibm.com. Your SSN is 123-45-6789.

The second document contains the following text:

Subject: Assistance 127.0.0.0 my_name@ibm.com with credit card update [213254000]

The output generates the following dictionary in a table format that is compatible with DuckDB SQL queries. To transform and visualize the output, an equivalent JavaScript Object Notation (JSON) format is used to illustrate this example:

[
    {
        "content": "Your email is support@ibm.com! 5340904586541378 Only the next instance of email will be processed. test@ibm.com. Your SSN is 123-45-6789.",
        "pii_bank_account": [
            {
                "detection": "BankAccountNumber.CreditCardNumber.Master",
                "end": 47,
                "score": 0.8,
                "start": 31,
                "text": "5340904586541378"
            }
        ],
        "pii_credit_card": null,
        "pii_email_address": [
            {
                "detection": "EmailAddress",
                "end": 29,
                "score": 0.8,
                "start": 14,
                "text": support@ibm.com
            },
            {
                "detection": "EmailAddress",
                "end": 111,
                "score": 0.8,
                "start": 99,
                "text": test@ibm.com
            }
        ],
        "pii_ip_address": [],
        "pii_phone_number": null,
        "pii_ssn_details": [
            {
                "detection": "NationalNumber.SocialSecurityNumber.US",
                "end": 136,
                "score": 0.8,
                "start": 125,
                "text": "123-45-6789"
            }
        ]
    },
    {
        "content": "Subject: Assistance 127.0.0.0 my_name@ibm.com with credit card update [213254000]",
        "pii_bank_account": [],
        "pii_credit_card": null,
        "pii_email_address": [
            {
                "detection": "EmailAddress",
                "end": 46,
                "score": 0.8,
                "start": 30,
                "text": my_name@ibm.com
            }
        ],
        "pii_ip_address": [
            {
                "detection": "IPAddress",
                "end": 29,
                "score": 0.8,
                "start": 20,
                "text": "127.0.0.0"
            }
        ],
        "pii_phone_number": null,
        "pii_ssn_details": []
    }
]

Add value examples for criteria list when Redaction is disabled for this output:

pii_ssn_details IS NOT NULL Filter all documents where the document must contain a social security number. The first document contains a social security number (123-45-6789), whereas the second document does not contain a social security number. As a result, the PII annotator node returns the output of the first document as the first row of the JSON object.

pii_phone_number IS NULL Filter all documents where the document does not contain a phone number. Both the first and second document does not contain a phone number. As a result, the PII annotator node returns the output of both the first and second documents with 2 rows of the JSON objects.

array_length(pii_email_address) = 2 Filter all documents where the document contains two email addresses. The first document contains two email addresses (support@ibm.com and test@ibm.com), whereas the second document contains only a single email address and not the required two email addresses as specified by the query. As a result, the PII annotator node returns the output of the first document as the first row of the JSON object.

array_length(array_filter(pii_email_address, x -> x.score >= 0.8)) > 0 Filter all documents where the confidence score for the email address is greater than or equal to 80%. A confidence score of a PII feature is a probability that indicates how accurate or reliable of detecting the target PII. In this example, the node is 80% sure that support@ibm.com is an actual email address. The confidence score is also 80% for test@ibm.com from the first document and my_name@ibm.com from the second document. As a result, the PII annotator node returns the output of both the first and second documents with 2 rows of the JSON objects.

Redaction

Enter a word or a regex pattern to automatically redact matching text:

Use a plain word for simple redactions, like: password.
Use a regex for advanced patterns, for example: \d{3}-\d{2}-\d{4} for social security numbers.

Select a character to be used for redaction.

Annotation filter

The Annotation filter node adds metadata annotations to documents to guide downstream extraction and improve how the content is processed by the language model.

Criteria list

The criteria list filter specifies a list of structured query language (SQL) without the where clause that you want to query during the document processing pipeline.

Click Add value to add a criteria list filter entry. And then you can type the SQL query without the where clause. For example:

lang_name='fr' Filter documents in French.

pii_email_address IS NOT NULL AND pii_ssn_details IS NOT NULL Filter documents that contain both email and social security details.

lang_name - potential values The -potential values parameter is handy to get a reference link or list of values for an operator.

To remove a filter entry for Criteria list, select the checkbox next to each query value that you want to remove from the pipeline. Or if you want to remove all your filter entries, select the Values checkbox. Then click the Delete icon.

Logical operator

Select either the And or Or conditional operator for processing your criteria list of SQL queries.

Features to drop

The features to drop filter specifies a list of feature names that you want to omit from the document processing pipeline.

Click Add value to add a features to drop filter entry. And then you can type the feature name that is from the Available features table. For example:

lang_score
docq_contain_common_en_words

To remove a filter entry for Features to drop, select the checkbox next to each feature name that you no longer want the document processing to omit from the pipeline. Or if you want to remove all your filter entries, select the Values checkbox. Then click the Delete icon.

Data class assignment

This operator assigns data classes to each document.

Terms and classifications

This operator assigns business terms and classifications to each document.

Classify documents

This operator classifies documents to assign them to the most appropriate document class. It uses predefined document classes to classify text in your documents and identify whether the data in your document matches a certain key-value pair format for correct text extraction into fields in an entity table. You can select which document classes to use in a flow. For more information, see Predefined document classes.

Next node in the flow

Keep in mind that any of these following Quality category nodes are optional nodes to add to your flow:

Document quality
PII and HAP annotator
Language annotator
Annotation filter

The guideline to sequence the nodes within the set of Quality category nodes is:

If you have a Document quality, Language annotator, or both nodes, and you want to filter documents based on those features, follow with an Annotation filter node.
If you have a PII and HAP annotator node with the Redaction enabled, the node redacts or mask your data and returns a count for all the redacted PII. You can choose if you want to add another document processing task to your flow with an Annotation filter node.

The Quality node whether you set any or none at all, must be followed with the Transform data node and then followed with the Generate output node to complete the flow.

Learn more

Creating a data preparation flow