Requesting an ANALYZE run

ANALYZE is the function that looks for sensitive data in memory dump records and flag those records for redaction, or even overlay that sensitive data with a redaction string if token-level redaction is requested. Regardless of how the job is initiated (via IPCS option 5.6 or via the BLSJDPA JCL in SYS1.SAMPLIB), two important configuration files are used:

Runtime configuration file
- This file is either built by the dialog by using the supplied parameters, or is specified via an in-stream DD statement in the BLSJDPA JCL.
analysis_config.json file
- This file provides additional detail on what to include or exclude while looking for sensitive data in memory dump records. It is located in the /configuration/ directory in the file system. You can either use the EDIT CONFIG FILE option Y in the ANALYZE IPCS panel to edit this file if you want to change it, or you can directly edit it using an editor that you are familiar with.

The Runtime configuration is a JSON file that is built by the dialog by using the parameters that are supplied on the panel, or can also be supplied as a file or an instream data set in JCL if you use JCL to submit the job. The runtime configuration file parameters (hand-coded in the BLSJDPA JCL) are:

"input_dataset": Specifies the location of the input memory dump. You must specify a data set name as follows: "//’<dump-dataset-name>’".
If using the ANALYZE IPCS panel, this parameter is populated by the DATA SET NAME field.
"thread_count": This specifies the number of worker threads to be created. The total number of threads is one more than this (there is one monitor thread). If omitted, the default value is 4 for the JCL interface. Valid values are 1-8. If using the ANALYZE IPCS panel, this parameter is populated by the NUMBER OF THREADS field.
"record_count": Estimated number of records in the input memory dump. If set to 0 or omitted, the Data Privacy for Diagnostics Analyzer will count the actual number of records. This is used when multiple threads are requested to ensure that the records are evenly split by the requested number of threads. This parameter is not available in IPCS panel interface.
"output_dataset_prefix" or "output_dataset": Specifies either the prefix to use for each thread's output memory dump data, or the list of datasets. Either data set names or DD names can be specified on this parameter. For example, "output_dataset_prefix":"//’SYS1.DUMP.D190926.T132348’" indicates the prefix that is used by each thread. Files will either be dynamically allocated or can be pre-allocated as SYS1.DUMP.D190926.T132348.F1, SYS1.DUMP.D190926.T132348.F2, SYS1.DUMP.D190926.T132348.F3, and so on, one per thread. When using the BLSJDPA JCL, you can also specify DDNAMEs as prefixes. For example, you can specify "output_dataset_prefix":"//DD:ANLZO" and specify DD statements for //ANLZOF1, //ANLZOF2, and so on, for each thread's output. Alternatively, you can use the "output_dataset" method of supplying a list of data sets. For example, you can specify "output_dataset":["//’SYS1.DUMP.D190926.T132348.F1’","//’SYS1.DUMP.D190926.T132348.F2’"]. If you are using the dialog to initiate the job, you can also specify a pattern to be used. In this case, the pattern can contain a single "%" character, which causes the dialog to generate data set names with thread numbers substituted in that position in the data set name. For example, you can specify 'SYS1.DUMP.D190926.T132348.F%' as the data set name pattern on the panel and the dialog would generate "output_dataset":["//’SYS1.DUMP.D190926.T132348.F1’","//’SYS1.DUMP.D190926.T132348.F2’"] as the parameters.
If using the ANALYZE IPCS panel, this parameter is populated by the TEMP DATA SET/PAT field.
"redaction_string": String to use for redaction for pages being analyzed by using detailed analysis. When a redaction_string isn’t specified and detailed analysis is specified, sensitive data is replaced with X. When longer strings are detected in the pages, the string is used in a repeated fashion. If shorter strings are found, only a portion of the redaction string can be used. If using the ANALYZE IPCS panel, this parameter is populated by the REDACTION STRING field.
"analysis_mode": Specifies the analysis mode for detecting sensitive data. Valid values are 1 for Page-Level Redaction and 2 for Token-Level Redaction. In Page-Level Redaction, the entire page is marked as sensitive as soon as first sensitive token is identified in the page. In Token-Level Redaction, each sensitive token is identified independently and overlaid with the redaction_string. Note that when 1 is specified, some pages can be analyzed by using Token-Level Redaction. If using the ANALYZE IPCS panel, this parameter is populated by the ALLOW PAGE LEVEL field, with Y being equivalent to 1 (Page-Level Redaction) and N being equivalent to 2 (Token-Level Redaction).
"log_sensitive_tokens": Specifies whether the sensitive token log for each file should be generated. When set to TRUE, a sensitive token log for each thread should be generated in the /reports folder. Valid values are TRUE and FALSE. If using the ANALYZE IPCS panel, this parameter is populated by the SENSITIVE REPORT field, with Y being equivalent to TRUE and N being equivalent to FALSE. NOTE: by specifying TRUE, an additional file is generated on each ANALYZE function request, thus the Data Privacy for Diagnostics Analyzer home directory fills up more quickly.
"dpfd_home": Specifies the Data Privacy for Diagnostics Analyzer home directory. If using the ANALYZE IPCS panel, this parameter is populated by the DPfD HOME DIR field.
"character_set": Specified the character set that should be used for decoding of the input memory dump. The default value is "Cp1047”. This parameter is not available in IPCS panel interface.
Valid values are Cp1047 and US-ASCII.

The analysis_config.json file:

As your experience with this ANALYZE processing matures, this file will likely become stable until major changes in data occur in your environment. You will likely start with the default file that is supplied with the product. As your usage evolves, you can decide to exclude or include certain built-in identifiers, use custom identifiers or add dependent identifiers. Start of change This file is in JSON format containing keyword-value pairs and arrays that are used to specify parameters to the ANALYZE function. End of change

Parameter Descriptions:

"built_in_identifiers_include"

This specifies the built-in identifiers, which should be used to detect sensitive tokens. Values that can be supplied here are listed in the table below. If nothing is specified, no identifiers are included. These built-in identifiers are specified in a comma-separated list with quotations around the identifier:

"built_in_identifiers_include" : [
"Identifier1",
"Identifier2",
...
"IdentifierN"
],

See Figure 1. End of change

"built_in_identifiers_exclude"

This specifies the built-in identifiers that should not be used to detect type of tokens. Values that can be supplied here are listed in the table below. If nothing is specified, no identifiers are excluded. These built-in identifiers are specified in a comma-separated list with quotations around the identifier as shown in Figure 1. End of change

"custom_identifiers"

This specifies any custom identifiers, which should be used to detect sensitive data. Custom identifiers can be ingested by using the INGEST function and can be in the form of dictionaries or patterns. Arrays of multiple identifiers can be specified Start of change

with the attributes of a single custom identifier that is enclosed in { } as described below and the identifiers themselves are separated by commas as shown in Figure 1 End of change

. Each identifier has the following fields:

"format": Valid value is "custom". "custom" indicates that a previously ingested dictionary or set of patterns is specified.
"inputfilename": File name containing the identifier data. If the format is "custom", the Data Privacy for Diagnostics Analyzer looks for the file in /knowledgebase/ingested/.
"entitytype": Specifies a name for the identifier. This is used when it is part of dependent identifier. If the entity name was provided when the file was ingested (for "custom" format), this field can be skipped. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
"description": Specifies description of the identifier. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.

"dependent_identifiers"

This specifies a set of identifiers, which are sensitive only when they all occur in a page. Here, you specify an array where multiple dependent identifiers can be specified Start of change

with the attributes of a single dependency that is enclosed in { } as described below and the dependencies themselves are separated by commas as shown in Figure 1 End of change

. When an identifier is made part of dependent_identifier, it stops being an independent identifier. Note that all of the identifiers that are specified in a dependent_identifier should be present in built_in_identifiers_include or in custom_identifiers; otherwise, this will never be detected. Each dependent identifier has the following fields:

"name": Specifies name that is assigned to this dependent identifier.
"identifiers": Specifies the set of identifiers belonging to this dependent identifier. These identifiers are specified in a comma-separated list with quotations around the identifier as shown in Figure 1.

“built_in_ns_identifiers_include”

This specifies the built-in identifiers, which should be used to detect that a token is nonsensitive. Start of change

These built-in identifiers are specified in a comma-separated list with quotations around the identifier as shown in Figure 1. End of change

“built_in_ns_identifiers_exclude”

This specifies the built-in identifiers, which should not be used to detect that a token is nonsensitive. Start of change

These built-in identifiers are specified in a comma-separated list with quotations around the identifier as shown in Figure 1. End of change

“custom_ns_identifiers”

This specifies any custom identifiers, which should be used to identify tokens as insensitive data. Custom identifiers can be ingested by using the INGEST function and can be in the form of dictionaries, patterns Start of change

with the attributes of a single custom identifier that is enclosed in { } as described below and the identifiers themselves are separated by commas as shown in Figure 1 End of change

. Arrays of multiple identifiers can be specified. Each identifier has the following fields:

"format": Valid value is "custom". "custom" indicates that a previously ingested dictionary or set of patterns is specified.
"inputfilename": File name containing the identifier data. The Data Privacy for Diagnostics Analyzer looks for the file in /knowledgebase/ingested/.
"entitytype": Specifies a name for the identifier. This is used when it is part of dependent identifier. If the entity name was provided when the file was ingested (for "custom" format), then this field can be skipped. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.
"description": Specifies description of the identifier. If values are provided both during ingestion and in analysis_config.json, the value that is provided in analysis_config takes precedence. If no value is provided either during ingestion or in analysis_config, a custom value is chosen.

“printable_characters”

This field specifies set of printable characters, which will be used for analysis of memory dumps. When analyzing the memory dump, only these characters are used to construct the tokens that will be parsed. Default value is "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890`~!@#$%^&*()-=_+ {}[];:'\",<.>/?\\|\n "

Any incorrect information in analysis_config.json ignored. Identifiers that are specified in the built_in_identifiers_exclude list take precedence over identifiers that are specified in the built_in_identifiers_include list.

The built_in_identifiers supplied with Data Privacy for Diagnostics Analyzer are (by their very nature are not necessarily all inclusive of the particular topic, and cannot be changed over time should there be changes to the actual data set of said topic):

Identifier (not case-sensitive)	Description
age	String patterns that are related to age. Examples: “10 years old” “4 months old” “dob: 1-2-1999" Note: These string patterns are considered sensitive, but just the number 10 (in the first example) will not be considered sensitive by itself.
continent	Dictionary containing the names of continents.
country	Dictionary containing the names of countries.
county	Dictionary containing the names of all the counties in US. In the United States of America, an administrative or political subdivision of a state is a county.
credit card type	Credit card identification dictionary. Cards that are detected are VISA, Mastercard, AMEX, Diners Club, Discover, and JCB.
credit card number	Credit card pattern identification. Cards patterns that are detected are VISA, Mastercard, AMEX, Diners Club, Discover, and JCB.
date time	Date and Time pattern identification.
day	Dictionary containing the names of the days of the week.
dependent	Dictionary containing the names of types of dependents, such as daughter, son, etc.
email	Email address pattern.
eu nin	National Identification Number patterns for various EU countries.
FullName	A first name and last name pair dictionary, the combination of which is from popular names in the US census. Note: This identifier detects a 2-word combination of first name and last name that is separated by a delimiter of a comma and/or spaces in either order. This identifier will NOT detect a name that contains any middle name/initial nor hyphenated names nor names that contain apostrophes.
gender	Dictionary containing the genders Male and Female.
iban	International Bank Account Number (IBAN) pattern
icdv9	International Classification of Diseases 9th Revision (ICDv9) identification dictionary.
icdv10	International Classification of Diseases 10th Revision (ICDv10) identification dictionary.
imei	International Mobile Equipment Identity (IMEI) identification dictionary. Note: This identifier only detects 15-digit IMEI tokens.
imsi	International Mobile Subscriber Identity (IMSI) Identification dictionary.
in aadhaar card number	Aadhaar identification number for residents or passport holders of India.
in PAN card number	Permanent Account Number pattern issued by Indian Income Tax Department
international phone number	International phone number identification pattern.
ip address	IP address identification pattern. Supports both IPv4 and IPv6 addresses
latitude longitude	Latitude/longitude identification pattern. Supports GPS and DMS coordinates formats, Ex: 12:30'23.256547S 12:30'23.256547E, N90.00.00 E180.00.00
mac address	MAC Address Identification pattern.
marital status	Marital status identifier dictionary.
medical name	Medical Name identification pattern. Example: John Doe MD
medical record number	Medical Record Number identification pattern, for example, MRN: CLM-00000056055, Medical Record Number: 1234asds
month	Month Name identification dictionary.
occupation	Occupation identification dictionary.
PO box	Identifies post office box numbers, Ex: P.O. BOX 334, POBOX 14321412
raceOrEthnicity	Dictionary identification of ethnic groups
religion	Dictionary containing major religions.
street types	Street Type identification, for example, tokens containing "st."
uk nin	National Insurance Number pattern.
us address	Identifies US-centric address patterns like “800 Theatre Court Garden City, NY 11530”. This just checks the format but does not validate city and state/zip in the address.
us phone number	US-specific phone/fax/pager identifier pattern.
us ssn	US Social Security Number pattern
us states	US State Name identification
vehicle identification number	Vehicle identification number identification dictionary. Supports world manufacturer identification dictionary.
year	Year of Birth Identification, Any number between 0 and current year.
zipcode	Valid US zip code identifier dictionary.

Note:

The following identifiers are no longer valid as of OA61591: Animal, ATC, Hospital Name, Phone Number, US SWIFT Code. End of change

The list of built_in_ns_identifiers supplied with Data Privacy for Diagnostics Analyzer are:

The default analysis_config.json file includes the list of included identifiers and excluded identifiers.

Identifier checking is done by using the following order:

User feedback indicates that a token is sensitive.
User feedback indicates that a token is nonsensitive.
Module name check
Nonsensitive checks (built-in + custom)
Sensitive checks (built-in + custom)

Note:

If the identifier is added into both the _include list and the _exclude list for either sensitive or nonsensitive checks, it is treated as excluded. End of change