Now you will run the Extraction application and validate the results. You can review the "A little preparation before getting started - The Known log types" and "The unknown log types" sections from Part 1: Speeding up machine data analysis of this series for more information on out-of-the-box support for various log types.
Perform the following steps.
- Run the Extraction application using the parameters shown in
Figure 2. The Source directory is: /GOMDADemo/input_batches, and
the output path is: /GOMDADemo/output/extract_out.
Figure 2. Run the Extraction application on email data with generic logtype
Note: You should always point to the directory containing the batches for Source directory even if there is only a single batch under it. This allows the application to work on one or more batches at a time.
- Browse the contents of the output path. Follow the steps shown in
Figure 3 to view the CSV result as a sheet. Then, save the
workbook as email_generic.
Figure 3. View output of Extraction as sheet
Note: The output directory contains a directory named after the batchID for each batch. A CSV file is named after the batchID containing the top 2000 results of extraction for each batch, which is the default setting.
You can change the setting to extract all or none of the results in extract.config. The default configuration is installed at /accelerators/MDA/extract_config/extract.config, but you can make your own copies and save in other preferred locations.
- Validate the results of extraction, including validating that the
primary timestamp and record boundaries are identified correctly.
Also validate that the timestamp normalization is correct. All of
the above is based on the values provided in metadata.json.
Figure 4 shows the columns in the sheet resulting from batch_inbox.csv.
Figure 4. Validate output of Extraction
Note that the charset column is extracted from the generic rule for name value pair rule. Since there can be numerous name value pairs in a record, only the first value is exported in the CSV file. In the next article in this series, you will learn how you can visualize all the extracted fields in a search user interface.
You will see a similar output for leaf tag pairs when data contains XML content. The first pair will be exported in the CSV file, but you can use the search interface to look at all of the results.
- If there are issues in validation, fix them now! When results are
not produced as expected, it is worthwhile to double check the key
information that drives those results.
For fixing incorrectly identified record boundaries or incorrect values in LogDateTime, double check the primary timestamp format, represented as dateTimeFormat in metadata.json. Also check the regular expression preceding the timestamp, where applicable, represented as preTimestampRegex in metadata.json.
For fixing incorrect values in LogDateTimeNormalized, in addition to the above, double check the missing information in the primary timestamp, represented as missingDateTimeDefaults in metadata.json, where applicable.
If the expected fields are not seen in the headers of the CSV file representing the result, double check the log type selected, represented as logType in metadata.json.
Notice that some of the interesting fields such as (To, From) in the emails were not extracted. This information is critical for analyzing emails.
Next, you will customize the Extraction application so these interesting fields for the email data can be extracted.