Requesting an ANALYZE run
- Runtime configuration file
- This file is either built by the dialog by using the supplied parameters, or is specified via an in-stream DD statement in the BLSJDPA JCL.
- analysis_config.json file
- This file provides additional detail on what to include or exclude while looking for sensitive data in memory dump records. It is located in the /configuration/ directory in the file system. You can either use the EDIT CONFIG FILE option Y in the ANALYZE IPCS panel to edit this file if you want to change it, or you can directly edit it using an editor that you are familiar with.
The Runtime configuration is a JSON file that is built by the dialog by using the parameters that are supplied on the panel, or can also be supplied as a file or an instream data set in JCL if you use JCL to submit the job. The runtime configuration file parameters (hand-coded in the BLSJDPA JCL) are:
- "input_dataset"
- Specifies the location of the input memory dump. You must specify a data set name as follows:
"//’<dump-dataset-name>’".
If using the ANALYZE IPCS panel, this parameter is populated by the DATA SET NAME field.
- "thread_count"
- This specifies the number of worker threads to be created. The total number of threads is one
more than this (there is one monitor thread). If omitted, the default value is 4 for the JCL
interface.
Valid values are 1-8. If using the ANALYZE IPCS
panel, this parameter is populated by the NUMBER OF THREADS
field.
- "record_count"
- Estimated number of records in the input memory dump. If set to 0 or omitted, the Data Privacy for Diagnostics Analyzer will count the actual number of records. This is used when multiple threads are requested to ensure that the records are evenly split by the requested number of threads. This parameter is not available in IPCS panel interface.
- "output_dataset_prefix" or "output_dataset"
- Specifies either the prefix to use for each thread's output memory dump data, or the list of
datasets. Either data set names or DD names can be specified on this parameter. For example,
"output_dataset_prefix":"//’SYS1.DUMP.D190926.T132348’" indicates the prefix that is used by each
thread. Files will either be dynamically allocated or can be pre-allocated as
SYS1.DUMP.D190926.T132348.F1, SYS1.DUMP.D190926.T132348.F2, SYS1.DUMP.D190926.T132348.F3, and so on,
one per thread. When using the BLSJDPA JCL, you can also specify DDNAMEs as prefixes. For example,
you can specify "output_dataset_prefix":"//DD:ANLZO" and specify DD statements for //ANLZOF1,
//ANLZOF2, and so on, for each thread's output. Alternatively, you can use the "output_dataset"
method of supplying a list of data sets. For example, you can specify
"output_dataset":["//’SYS1.DUMP.D190926.T132348.F1’","//’SYS1.DUMP.D190926.T132348.F2’"]. If you are
using the dialog to initiate the job, you can also specify a pattern to be used. In this case, the
pattern can contain a single "%" character, which causes the dialog to generate data set names with
thread numbers substituted in that position in the data set name. For example, you can specify
'SYS1.DUMP.D190926.T132348.F%' as the data set name pattern on the panel and the dialog would
generate "output_dataset":["//’SYS1.DUMP.D190926.T132348.F1’","//’SYS1.DUMP.D190926.T132348.F2’"] as
the parameters.
If using the ANALYZE IPCS panel, this parameter
is populated by the TEMP DATA SET/PAT field.
- "redaction_string"
- String to use for redaction for pages being analyzed by using detailed analysis. When a
redaction_string isn’t specified and detailed analysis is specified, sensitive data is replaced with
X.
When longer strings are detected in the pages, the string
is used in a repeated fashion. If shorter strings are found, only a portion of the redaction string
can be used. If using the ANALYZE IPCS panel, this parameter is populated by the REDACTION STRING
field.
- "analysis_mode"
Specifies the analysis mode for detecting sensitive data.
Valid values are 1 for Page-Level Redaction and 2 for Token-Level Redaction. In Page-Level
Redaction, the entire page is marked as sensitive as soon as first sensitive token is identified in
the page. In Token-Level Redaction, each sensitive token is identified independently and overlaid
with the redaction_string. Note that when 1 is specified, some pages can be analyzed by using
Token-Level Redaction. If using the ANALYZE IPCS panel, this parameter is populated by the ALLOW
PAGE LEVEL field, with Y being equivalent to 1 (Page-Level Redaction) and N being equivalent to 2
(Token-Level Redaction).
- "log_sensitive_tokens"
- Specifies whether the sensitive token log for each file should be generated. When set to TRUE, a
sensitive token log for each thread should be generated in the /reports
folder.
Valid values are TRUE and FALSE. If using the ANALYZE
IPCS panel, this parameter is populated by the SENSITIVE REPORT field, with Y being equivalent to
TRUE and N being equivalent to FALSE. NOTE: by specifying TRUE, an additional file is generated on
each ANALYZE function request, thus the Data Privacy for Diagnostics Analyzer home directory fills
up more quickly.
- "dpfd_home"
- Specifies the Data Privacy for Diagnostics Analyzer home directory.
If using the ANALYZE IPCS panel, this parameter is populated by the DPfD HOME DIR
field.
- "character_set"
- Specified the character set that should be used for decoding of the input memory dump. The
default value is "Cp1047”. This parameter is not available in IPCS panel interface.
Valid values are Cp1047 and
US-ASCII.
The analysis_config.json file:
As your experience with this ANALYZE processing matures, this file will likely become stable
until major changes in data occur in your environment. You will likely start with the default file
that is supplied with the product. As your usage evolves, you can decide to exclude or include
certain built-in identifiers, use custom identifiers or add dependent identifiers.
This file is in JSON format containing keyword-value pairs and
arrays that are used to specify parameters to the ANALYZE
function.
Parameter Descriptions:
- "built_in_identifiers_include"
This specifies the built-in identifiers, which should be used
to detect sensitive tokens. Values that can be supplied here are listed in the table below. If
nothing is specified, no identifiers are included. These built-in identifiers are specified in a
comma-separated list with quotations around the
identifier:
See Figure 1."built_in_identifiers_include" : [ "Identifier1", "Identifier2", ... "IdentifierN" ],
- "built_in_identifiers_exclude"
This specifies the built-in identifiers that should not be
used to detect type of tokens. Values that can be supplied here are listed in the table below. If
nothing is specified, no identifiers are excluded. These built-in identifiers are specified in a
comma-separated list with quotations around the identifier as shown in Figure 1.
- "custom_identifiers"
- This specifies any custom identifiers, which should be used to detect sensitive data. Custom
identifiers can be ingested by using the INGEST function and can be in the form of dictionaries or
patterns. Arrays of multiple identifiers can be specified
with
the attributes of a single custom identifier that is enclosed in { } as described below and the
identifiers themselves are separated by commas as shown in Figure 1
. Each
identifier has the following fields:- "format"
- Valid value is "custom". "custom" indicates that a previously ingested dictionary or set of patterns is specified.
- "inputfilename"
- File name containing the identifier data. If the format is "custom", the Data Privacy for Diagnostics Analyzer looks for the file in /knowledgebase/ingested/.
- "entitytype"
- Specifies a name for the identifier. This is used when it is part of dependent identifier. If the entity name was provided when the file was ingested (for "custom" format), this field can be skipped. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
- "description"
- Specifies description of the identifier. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
- "dependent_identifiers"
- This specifies a set of identifiers, which are sensitive only when they all occur in a page.
Here, you specify an array where multiple dependent identifiers can be specified
with the attributes of a single dependency that is enclosed in {
} as described below and the dependencies themselves are separated by commas as shown in Figure 1
. When an
identifier is made part of dependent_identifier, it stops being an independent identifier. Note that
all of the identifiers that are specified in a dependent_identifier should be present in
built_in_identifiers_include or in custom_identifiers; otherwise, this will never be detected. Each
dependent identifier has the following fields:- "name"
- Specifies name that is assigned to this dependent identifier.
- "identifiers"
- Specifies the set of identifiers belonging to this dependent identifier.
These identifiers are specified in a comma-separated list with quotations around the
identifier as shown in Figure 1.
- “built_in_ns_identifiers_include”
- This specifies the built-in identifiers, which should be used to detect that a token is
nonsensitive.
These built-in identifiers are specified in a
comma-separated list with quotations around the identifier as shown in Figure 1.
- “built_in_ns_identifiers_exclude”
- This specifies the built-in identifiers, which should not be used to detect that a token is
nonsensitive.
These built-in identifiers are specified in a
comma-separated list with quotations around the identifier as shown in Figure 1.
- “custom_ns_identifiers”
- This specifies any custom identifiers, which should be used to identify tokens as insensitive
data. Custom identifiers can be ingested by using the INGEST function and can be in the form of
dictionaries, patterns
with the attributes of a single custom
identifier that is enclosed in { } as described below and the identifiers themselves are separated
by commas as shown in Figure 1
. Arrays of multiple identifiers can be specified.
Each identifier has the following fields:- "format"
- Valid value is "custom". "custom" indicates that a previously ingested dictionary or set of patterns is specified.
- "inputfilename"
- File name containing the identifier data.
T
he Data
Privacy for Diagnostics Analyzer looks for the file in
/knowledgebase/ingested/. - "entitytype"
- Specifies a name for the identifier. This is used when it is part of dependent identifier. If the entity name was provided when the file was ingested (for "custom" format), then this field can be skipped. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
- "description"
- Specifies description of the identifier. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
- “printable_characters”
- This field specifies set of printable characters, which will be used for analysis of memory dumps. When analyzing the memory dump, only these characters are used to construct the tokens that will be parsed. Default value is "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890`~!@#$%^&*()-=_+ {}[];:'\",<.>/?\\|\n "
Any incorrect information in analysis_config.json ignored. Identifiers that are specified in the built_in_identifiers_exclude list take precedence over identifiers that are specified in the built_in_identifiers_include list.
{
"built_in_identifiers_include": ["Month", "FullName", "Credit Card Type", "Credit Card Number", "Email",
"Zipcode",
"Day"],
"built_in_identifiers_exclude": ["Year", "Zipcode", "Date Time"],
"custom_identifiers": [
{
"inputfilename" : "acctnum.bin",
"entitytype" : "Account Number",
"description" : "List of account numbers",
"format" : "custom"
},
{
"inputfilename" : "policynum.bin",
"entitytype" : "PolicyNumber",
"description" : "List of policy numbers",
"format" : "custom"
}
],
"dependent_identifiers": [
{
"name": "Full Person",
"identifiers": [ "FullName", "Zipcode", "Email" ]
},
{
"name": "Card",
"identifiers": [ "Credit Card Type", "Credit Card Number"]
}
],
"built_in_ns_identifiers_include": ["ModuleName"],
"built_in_ns_identifiers_exclude": [],
"custom_ns_identifiers": [
{
"inputfilename":"branch.bin",
"entitytype" : "Branch Name",
"description" : "List of branch locations",
"format" : "custom"
},
{
"inputfilename" : "zone.bin”,
"entitytype" : "Zone",
"description" : "List of zones",
"format" : "custom"
}
]
}
The built_in_identifiers supplied with Data Privacy for Diagnostics Analyzer are (by their very nature are not necessarily all inclusive of the particular topic, and cannot be changed over time should there be changes to the actual data set of said topic):
| Identifier (not case-sensitive) | Description |
| age | String patterns that are related to age. Examples: “10 years old” “4 months old” “dob: 1-2-1999" Note: These string patterns are
considered sensitive, but just the number 10 (in the first example) will not be considered sensitive
by itself.
|
| continent | Dictionary containing the names of continents. |
| country | Dictionary containing the names of countries. |
| county | Dictionary containing the names of all the counties in US. In the United States of America, an administrative or political subdivision of a state is a county. |
| credit card type | Credit card identification dictionary. Cards that are detected are VISA, Mastercard, AMEX, Diners Club, Discover, and JCB. |
| credit card number | Credit card pattern identification. Cards patterns that are detected are VISA, Mastercard, AMEX, Diners Club, Discover, and JCB. |
| date time | Date and Time pattern identification. |
| day | Dictionary containing the names of the days of the week. |
| dependent | Dictionary containing the names of types of dependents, such as daughter, son, etc. |
| Email address pattern. | |
| eu nin | National Identification Number patterns for various EU countries. |
| FullName | A first name and last name pair dictionary, the combination of which is from popular names in
the US census. Note:
This identifier detects a 2-word combination
of first name and last name that is separated by a delimiter of a comma and/or spaces in either
order. This identifier will NOT detect a name that contains any middle name/initial nor hyphenated
names nor names that contain apostrophes.![]() |
| gender | Dictionary containing the genders Male and Female. |
| iban | International Bank Account Number (IBAN) pattern |
| icdv9 | International Classification of Diseases 9th Revision (ICDv9) identification dictionary. |
| icdv10 | International Classification of Diseases 10th Revision (ICDv10) identification dictionary. |
| imei | International Mobile Equipment Identity (IMEI) identification dictionary. Note:
This identifier only detects 15-digit IMEI
tokens.![]() |
| imsi | International Mobile Subscriber Identity (IMSI) Identification dictionary. |
| in aadhaar card number | Aadhaar identification number for residents or passport holders of India. |
| in PAN card number | Permanent Account Number pattern issued by Indian Income Tax Department |
| international phone number | International phone number identification pattern. |
| ip address | IP address identification pattern. Supports both IPv4 and IPv6 addresses |
| latitude longitude | Latitude/longitude identification pattern. Supports GPS and DMS coordinates formats, Ex: 12:30'23.256547S 12:30'23.256547E, N90.00.00 E180.00.00 |
| mac address | MAC Address Identification pattern. |
| marital status | Marital status identifier dictionary. |
| medical name | Medical Name identification pattern. Example: John Doe MD |
| medical record number | Medical Record Number identification pattern, for example, MRN: CLM-00000056055, Medical Record Number: 1234asds |
| month | Month Name identification dictionary. |
| occupation | Occupation identification dictionary. |
| PO box | Identifies post office box numbers, Ex: P.O. BOX 334, POBOX 14321412 |
| raceOrEthnicity | Dictionary identification of ethnic groups |
| religion | Dictionary containing major religions. |
| street types | Street Type identification, for example, tokens containing "st." |
| uk nin | National Insurance Number pattern. |
| us address | Identifies US-centric address patterns like “800 Theatre Court Garden City, NY 11530”. This just checks the format but does not validate city and state/zip in the address. |
| us phone number | US-specific phone/fax/pager identifier pattern. |
| us ssn | US Social Security Number pattern |
| us states | US State Name identification |
| vehicle identification number | Vehicle identification number identification dictionary. Supports world manufacturer identification dictionary. |
| year | Year of Birth Identification, Any number between 0 and current year. |
| zipcode | Valid US zip code identifier dictionary. |
The following identifiers are no
longer valid as of OA61591: Animal, ATC, Hospital Name, Phone Number, US SWIFT
Code.
The list of built_in_ns_identifiers supplied with Data Privacy for Diagnostics Analyzer are:
The default analysis_config.json file includes the list of included identifiers and excluded identifiers.
Identifier checking is done by using the following order:
- User feedback indicates that a token is sensitive.
- User feedback indicates that a token is nonsensitive.
- Module name check
- Nonsensitive checks (built-in + custom)
- Sensitive checks (built-in + custom)
If the identifier is added into both the _include list and
the _exclude list for either sensitive or nonsensitive checks, it is treated as
excluded.
