Quality nodes
The Quality category nodes contain the following various quality filters:
- Language annotator
- De-duplicator
- Document quality
- PII and HAP annotator
- Redaction
- Annotation filter
- Data class assignment
- Terms and classifications
- Classify documents
Any of these Quality category nodes are optional nodes to add to your flow. For a guideline on how to sequence the Quality nodes in the flow, see the Next node in the flow section.
Language annotator
The Language annotator node helps ensure accurate processing by detecting the language of documents before ingestion into the language model.
With the Filter if language cannot be detected toggle, you can define how to proceed when the language of the document cannot be recognized.
- When set to On, such documents are filtered out from further processing. The final status of the node would then be
Completed With Errors. - When set to Off (default), documents with unrecognized language are processed, the language
lang_nameis defined as "UKNOWN" and the language scorelang_scoreis set to 0. The final status of the node isCompleted With Warnings.
The following table lists the features that are added as operators to the language filter node:
| Feature name | Data value returned | Description | Add value examples for criteria list |
|---|---|---|---|
lang_nameLanguage name |
string | Specifies the two-letter identified language code or name according to ISO 639 specification (International Organization for Standardization). For example en for English. |
lang_name='en'Filter documents in English. |
lang_scoreLanguage score |
float | Specifies the score of language prediction. The score is a decimal value between 0 to 1. | lang_score > 0Filter documents that contains the target language. |
De-duplicator
The node removes duplicates of documents. No parameters are required.
Document quality
The Document quality node calculates and annotates several metrics that are related to a document, which are useful to see the quality of the document. You can profile documents to extract key metadata and structure. In addition, optimize the quality of the documents for ingestion into language models.
The following table lists the features that are added as operators to the document quality node:
| Feature name | Data value returned | Description | Add value examples for criteria list |
|---|---|---|---|
docq_total_wordsWord count |
integer | The total number of words. | docq_total_words >= 50Filter documents where the total document word count is greater than or equal to 50. |
docq_mean_word_lenMean word length |
double | The mean length of words. | docq_mean_word_len = 3Filter documents where the average word is three characters long. |
docq_symbol_to_word_ratioSymbols to word |
double | The ratio of symbols to words. Examples of symbols are emojis, ellipsis (...), and hash (#). |
docq_symbol_to_word_ratio >= 0.5Filter documents where more than half of the document contains symbols than words. This query helps detect if the document is an image (opposed to a text) file. |
docq_sentence_countSentence count |
integer | The number of sentences. | docq_sentence_count >= 1Filter documents where the document contains at least one full sentence. |
docq_lorem_ipsum_ratioLorem ipsum to text length |
double | The ratio between the number of occurrences of lorem ipsum over the text length. Lorem ipsum (or sometimes known as lipsum) is random text that is used in laying out prints, graphics, or web designs. |
docq_lorem_ipsum_ratio < 0.05Filter documents where less than 5% of the document contains random text. This query helps detect if the document is a file that contains readable text (opposed to an image) file. |
docq_curly_bracket_ratioCurly bracket to text length |
double | The ratio between the number of occurrences of { or } curly brackets over the text length. |
docq_curly_bracket_ratio >= 0.5Filter documents where more than 5% of the document contains curly brackets. This query helps detect if the document is a file that contains code in text. |
docq_contain_bad_wordAny curse words? |
boolean | Whether the text contains swear words. | NOT (docq_contain_bad_word = false)Filter documents where the documents do contain curse words. |
docq_bullet_point_ratioHow many lines start with a bullet point? |
double | The ratio of lines that start with a bullet point. | docq_bullet_point_ratio IN (0.1, 0.3, 0.05)Filter documents where 10%, 30%, or 5% of the document starts with a bullet point. |
docq_ellipsis_line_ratioHow many lines end with an ellipsis? |
double | The ratio of lines that end with an ellipsis (...). |
docq_ellipsis_line_ratio BETWEEN 0.5 AND 1Filter documents where inclusively 50% to 100% of the document have lines that end with an ellipsis. |
docq_alphabet_word_ratioHow many words have at least one alphabetic character? |
double | The ratio of words that have at least one alphabetic character. | docq_alphabet_word_ratio = 0Filter documents where the document contains no words. This query helps detect if the document is a binary coded file. |
docq_contain_common_en_wordsAny English words? |
boolean | Whether the specific text contains common English words, such as the, and, to, that, of, with, be, and have. |
docq_contain_common_en_words IS NOT NULL AND docq_contain_common_en_words = trueFilter documents where the document is not an empty file and must contain common English words. |
PII and HAP annotator
The PII and HAP annotator node identifies and annotates PII (personally identifiable information) and HAP (Hate Abuse Profanity) to maintain data privacy during model ingestion.
Select if you want to redact just PII, HAP, or both.
Select which PII information you want to detect:
- Bank account number
- Credit card number
- Email address
- IP address
- Phone number
- Social security number
For the Redaction and HAP redaction toggle, specify whether you want to remove or mask any identified PII or HAP from the document.
If you want to redact the detected PII from the document, leave the Redaction toggle on. It is on by default. After the documents are processed for the annotator, each of the feature operator returns a 64-bit integer count value that specifies the number of occurrences of the PII detected in the documents. You can query on the count value that is returned with an Annotation filter node. You can find query examples that are listed in the table under
the column labeled: Add value examples for criteria list when Redaction is enabled.
If the Redaction toggle is off, after the documents are processed with the annotator node, a dictionary is generated of all the PII detected, along with metadata of each detected PII features. The dictionary is the output in a JavaScript Object Notation (JSON) format. For more information, see the Understanding how the PII annotator processes documents when redaction is disabled section.
In the Masking character field, customize the character that you want to replace the detected PII or HAP from the document. You must specify the masking character as a single character. The data value is replaced with a string
of similar length that is composed of the masking character. For example if the masking character is set to X, the phone number 510-555-1234 is replaced with XXXXXXXXXXXX. If no character is specified,
any detected PII or HAP is not masked, but is redacted and removed from the document.
In the HAP threshold field, specify a number between 0 and 1 that determines how much HAP content should be removed from the document. It represents a sensitivity level.
The following table lists the features that are added as operators to the PII annotator node:
| Feature name | Detection method | Description | Add value examples for criteria list when Redaction is enabled |
|---|---|---|---|
pii_bank_accountBank account |
Pattern match and context. | A bank account number between 8 to 17 digits. | pii_bank_account >= 1Filter documents where in the document exists at least one bank account number. |
pii_credit_cardCredit card |
Pattern match and checksum. | A credit card number is between 12 to 19 digits. | pii_credit_card = 3Filter documents where in the document contains three credit card numbers. |
pii_email_addressEmail address |
Pattern match, context, and according to RFC-822 validation. | An email address identifies an email box to which email messages are delivered. | pii_email_address > 5Filter documents where in the document exists more than 5 email addresses. |
pii_ip_addressIP address |
Pattern match, context, and checksum. | An Internet Protocol (IP) address (either IPv4 or IPv6). | pii_ip_adress <= 8Filter documents where in the document exists at most 8 IP addresses. |
pii_phone_numberPhone number |
Custom logic, pattern match, and context. | A telephone number. | pii_phone_number <> 0Filter documents where the count of the phone number redacted is not 0. |
pii_ssn_detailsSocial security number |
Pattern match and context. | A social security number (SSN) with 9 digits. | NOT (pii_ssn_details = 0)Filter documents where the count of the social security number redacted is not 0. |
Understanding how the PII annotator processes documents when redaction is disabled
When you switch the Redaction off, the PII annotator keeps the data in the document unmasked. Instead of returning a 64-bit integer count value, the PII annotator stores a dictionary of all the PII features and
collects metadata for each PII feature.
For example, the PII annotator is given these two document inputs:
- The first document contains the following text:
Your email is support@ibm.com! 5340904586541378 Only the next instance of email will be processed. test@ibm.com. Your SSN is 123-45-6789.
- The second document contains the following text:
Subject: Assistance 127.0.0.0 my_name@ibm.com with credit card update [213254000]
The output generates the following dictionary in a table format that is compatible with DuckDB SQL queries. To transform and visualize the output, an equivalent JavaScript Object Notation (JSON) format is used to illustrate this example:
[
{
"content": "Your email is support@ibm.com! 5340904586541378 Only the next instance of email will be processed. test@ibm.com. Your SSN is 123-45-6789.",
"pii_bank_account": [
{
"detection": "BankAccountNumber.CreditCardNumber.Master",
"end": 47,
"score": 0.8,
"start": 31,
"text": "5340904586541378"
}
],
"pii_credit_card": null,
"pii_email_address": [
{
"detection": "EmailAddress",
"end": 29,
"score": 0.8,
"start": 14,
"text": support@ibm.com
},
{
"detection": "EmailAddress",
"end": 111,
"score": 0.8,
"start": 99,
"text": test@ibm.com
}
],
"pii_ip_address": [],
"pii_phone_number": null,
"pii_ssn_details": [
{
"detection": "NationalNumber.SocialSecurityNumber.US",
"end": 136,
"score": 0.8,
"start": 125,
"text": "123-45-6789"
}
]
},
{
"content": "Subject: Assistance 127.0.0.0 my_name@ibm.com with credit card update [213254000]",
"pii_bank_account": [],
"pii_credit_card": null,
"pii_email_address": [
{
"detection": "EmailAddress",
"end": 46,
"score": 0.8,
"start": 30,
"text": my_name@ibm.com
}
],
"pii_ip_address": [
{
"detection": "IPAddress",
"end": 29,
"score": 0.8,
"start": 20,
"text": "127.0.0.0"
}
],
"pii_phone_number": null,
"pii_ssn_details": []
}
]
Add value examples for criteria list when Redaction is disabled for this output:
pii_ssn_details IS NOT NULL
Filter all documents where the document must contain a social security number. The first document contains a social security number (123-45-6789), whereas the second document does not contain a social
security number. As a result, the PII annotator node returns the output of the first document as the first row of the JSON object.
pii_phone_number IS NULL
Filter all documents where the document does not contain a phone number. Both the first and second document does not contain a phone number. As a result, the PII annotator node returns the output of both the first
and second documents with 2 rows of the JSON objects.
array_length(pii_email_address) = 2
Filter all documents where the document contains two email addresses. The first document contains two email addresses (support@ibm.com and test@ibm.com), whereas the second document contains
only a single email address and not the required two email addresses as specified by the query. As a result, the PII annotator node returns the output of the first document as the first row of the JSON object.
array_length(array_filter(pii_email_address, x -> x.score >= 0.8)) > 0
Filter all documents where the confidence score for the email address is greater than or equal to 80%. A confidence score of a PII feature is a probability that indicates how accurate or reliable of detecting
the target PII. In this example, the node is 80% sure that support@ibm.com is an actual email address. The confidence score is also 80% for test@ibm.com from the first document and my_name@ibm.com from
the second document. As a result, the PII annotator node returns the output of both the first and second documents with 2 rows of the JSON objects.
Redaction
Enter a word or a regex pattern to automatically redact matching text:
- Use a plain word for simple redactions, like:
password. - Use a regex for advanced patterns, for example:
\d{3}-\d{2}-\d{4}for social security numbers.
Select a character to be used for redaction.
Annotation filter
The Annotation filter node adds metadata annotations to documents to guide downstream extraction and improve how the content is processed by the language model.
Criteria list
The criteria list filter specifies a list of structured query language (SQL) without the where clause that you want to query during the document processing pipeline.
Click Add value to add a criteria list filter entry. And then you can type the SQL query without the where clause. For example:
lang_name='fr'
Filter documents in French.
pii_email_address IS NOT NULL AND pii_ssn_details IS NOT NULL
Filter documents that contain both email and social security details.
lang_name - potential values
The -potential values parameter is handy to get a reference link or list of values for an operator.
To remove a filter entry for Criteria list, select the checkbox next to each query value that you want to remove from the pipeline. Or if you want to remove all your filter entries, select the Values checkbox. Then click the Delete icon.
Logical operator
Select either the And or Or conditional operator for processing your criteria list of SQL queries.
Features to drop
The features to drop filter specifies a list of feature names that you want to omit from the document processing pipeline.
Click Add value to add a features to drop filter entry. And then you can type the feature name that is from the Available features table. For example:
lang_scoredocq_contain_common_en_words
To remove a filter entry for Features to drop, select the checkbox next to each feature name that you no longer want the document processing to omit from the pipeline. Or if you want to remove all your filter entries, select the Values checkbox. Then click the Delete icon.
Data class assignment
This operator assigns data classes to each document.
Terms and classifications
This operator assigns business terms and classifications to each document.
Classify documents
This operator classifies documents to assign them to the most appropriate document class. It uses predefined document classes to classify text in your documents and identify whether the data in your document matches a certain key-value pair format for correct text extraction into fields in an entity table. You can select which document classes to use in a flow. For more information, see Predefined document classes.
Next node in the flow
Keep in mind that any of these following Quality category nodes are optional nodes to add to your flow:
The guideline to sequence the nodes within the set of Quality category nodes is:
- If you have a Document quality, Language annotator, or both nodes, and you want to filter documents based on those features, follow with an Annotation filter node.
- If you have a PII and HAP annotator node with the Redaction enabled, the node redacts or mask your data and returns a
countfor all the redacted PII. You can choose if you want to add another document processing task to your flow with an Annotation filter node.
The Quality node whether you set any or none at all, must be followed with the Transform data node and then followed with the Generate output node to complete the flow.