Sample node options

You can choose the Simple or Complex method as appropriate for your requirements.

Simple sampling options

The Simple method allows you to select a random percentage of records, select contiguous records, or select every nth record.

Mode. Select whether to pass (include) or discard (exclude) records for the following modes:

  • Include sample. Includes selected records in the data stream and discards all others. For example, if you set the mode to Include sample and set the 1-in-n option to 5, then every fifth record will be included, yielding a dataset that is roughly one-fifth the original size. This is the default mode when sampling data, and the only mode when using the complex method.
  • Discard sample. Excludes selected records and includes all others. For example, if you set the mode to Discard sample and set the 1-in-n option to 5, then every fifth record will be discarded. This mode is only available with the simple method.

Sample. Select the method of sampling from the following options:

  • First. Select to use contiguous data sampling. For example, if the maximum sample size is set to 10000, then the first 10,000 records will be selected.
  • 1-in-n. Select to sample data by passing or discarding every nth record. For example, if n is set to 5, then every fifth record will be selected.
  • Random %. Select to sample a random percentage of the data. For example, if you set the percentage to 20, then 20% of the data will either be passed to the data stream or discarded, depending on the mode selected. Use the field to specify a sampling percentage. You can also specify a seed value using the Set random seed control.

    Use block level sampling (in-database only). This option is enabled only if you choose random percentage sampling when performing in-database mining on an Oracle or IBM Db2 database. In these circumstances, block-level sampling can be more efficient.

    Note: You do not get an exact number of rows returned each time you run the same random sample settings. This is because each input record has a probability of N/100 of being included in the sample (where N is the Random % you specify in the node) and the probabilities are independent; therefore the results are not exactly N%.

Maximum sample size. Specifies the maximum number of records to include in the sample. This option is redundant and therefore disabled when First and Include are selected. Also note that when used in combination with the Random % option, this setting may prevent certain records from being selected. For example, if you have 10 million records in your dataset, and you select 50% of records with a maximum sample size of three million records, then 50% of the first six million records will be selected, and the remaining four million records have no chance of being selected. To avoid this limitation, select the Complex sampling method, and request a random sample of three million records without specifying a cluster or stratify variable.

Complex sampling options

Complex sample options allow for finer control of the sample, including clustered, stratified, and weighted samples along with other options.

Cluster and stratify. Allows you to specify cluster, stratify, and input weight fields if needed. See the topic Cluster and Stratify Settings for more information.

Sample type.

  • Random. Selects clusters or records randomly within each strata.
  • Systematic. Selects records at a fixed interval. This option works like the 1 in n method, except the position of the first record changes depending on a random seed. The value of n is determined automatically based on the sample size or proportion.

Sample units. You can select proportions or counts as the basic sample units.

Sample size. You can specify the sample size in several ways:

  • Fixed. Allows you to specify the overall size of the sample as a count or proportion.
  • Custom. Allows you to specify the sample size for each subgroup or strata. This option is only available if a stratification field has been specified in the Cluster and Stratify sub dialog box.
  • Variable. Allows the user to pick a field that defines the sample size for each subgroup or strata. This field should have the same value for each record within a particular stratum; for example, if the sample is stratified by county, then all records with county = Surrey must have the same value. The field must be numeric and its values must match the selected sample units. For proportions, values should be greater than 0 and less than 1; for counts, the minimum value is 1.

Minimum sample per stratum. Specifies a minimum number of records (or minimum number of clusters if a cluster field is specified).

Maximum sample per stratum. Specifies a maximum number of records or clusters. If you select this option without specifying a cluster or stratify field, a random or systematic sample of the specified size will be selected.

Set random seed. When sampling or partitioning records based on a random percentage, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value, or click the Generate button to automatically generate a random value. If this option is not selected, a different sample will be generated each time the node is executed.

Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. See the topic Sort Node for more information.