Data Audit Node Settings Tab
The Settings tab enables you to specify basic parameters for the audit.
Default. You can simply attach the node to your stream and click Run to generate an audit report for all fields based on default settings, as follows:
- If there are no Type node settings, all fields are included in the report.
- If there are Type settings (regardless of whether or not they are instantiated), all Input, Target, and Both fields are included in the display. If there is a single Target field, use it as the Overlay field. If there is more than one Target field specified, no default overlay is specified.
Use custom fields. Select this option to manually select fields. Use the field chooser button on the right to select fields individually or by type.
Overlay field. The overlay field is used in drawing the thumbnail graphs shown in the audit report. In the case of a continuous (numeric range) field, bivariate statistics (covariance and correlation) are also calculated. If a single Target field is present based on Type node settings, it is used as the default overlay field as described above. Alternatively, you can select Use custom fields in order to specify an overlay.
Display. Enables you to specify whether graphs are available in the output, and to choose the statistics displayed by default.
- Graphs. Displays a graph for each selected field; either a distribution (bar) graph, histogram, or scatterplot as appropriate for the data. Graphs are displayed as thumbnails in the initial report, but full-sized graphs and graph nodes can also be generated. See the topic Data Audit Output Browser for more information.
- Basic/Advanced statistics. Specifies the level of statistics displayed in the output by default. While this setting determines the initial display, all statistics are available in the output regardless of this setting. See the topic Display Statistics for more information.
Median and mode. Calculates the median and mode for all fields in the report. Note that with large datasets, these statistics may increase processing time, since they take longer than others to compute. In the case of the median only, the reported value may be based on a sample of 2000 records (rather than the full dataset) in some cases. This sampling is done on a per-field basis in cases where memory limits would otherwise be exceeded. When sampling is in effect, the results will be labeled as such in the output (Sample Median rather than just Median). All statistics other than the median are always computed using the full dataset.
Empty or typeless fields. When used with instantiated data, typeless fields are not included in the audit report. To include typeless fields (including empty fields), select Clear All Values in any upstream Type nodes. This ensures that data are not instantiated, causing all fields to be included in the report. For example, this may be useful if you want to obtain a complete list of all fields or generate a Filter node that will exclude those that are empty. See the topic Filtering Fields with Missing Data for more information.