Sample node

You can use Sample nodes to select a subset of records for analysis, or to specify a proportion of records to discard. Various sample types are supported, including stratified, clustered, and nonrandom (structured) samples.

Sampling can be used for several reasons:

  • To improve performance by estimating models on a subset of the data. Models that are estimated from a sample are often as accurate as models derived from the full data set. And they can be even more accurate if you can use the improved performance to experiment with more methods than you might otherwise attempt.
  • To select groups of related records or transactions for analysis, such as selecting all the items in an online shopping cart (or market basket), or all the properties in a specific neighborhood.
  • To identify units or cases for random inspection in the interest of quality assurance, fraud prevention, or security.
Note: If you simply want to partition your data into training and test samples for purposes of validation, a Partition node can be used instead. For more information, see Partition node.

Types of samples

Clustered samples. Sample groups or clusters rather than individual units. For example, suppose you have a data file with one record per student. If you cluster by school and the sample size is 50%, then 50% of schools are chosen and all students from each of the selected schools are picked. Students in the other schools are ignored. On average, you would expect about 50% of students to be picked, but because schools vary in size, the percentage might not be exact. Similarly, you could cluster shopping cart items by transaction ID to make sure that all items from selected transactions are maintained.

Stratified samples. Select samples independently within non-overlapping subgroups of the population, or strata. For example, you can ensure that men and women are sampled in equal proportions, or that every region or socioeconomic group within an urban population is represented. You can also specify a different sample size for each stratum (for example, if you think that one group is under-represented in the original data).

Systematic or 1-in-n sampling. When selection at random is difficult to obtain, units can be sampled systematically (at a fixed interval) or sequentially.

Sampling weights. Sampling weights are automatically computed while drawing a complex sample and roughly correspond to the "frequency" that each sampled unit represents in the original data. Therefore, the sum of the weights over the sample should estimate the size of the original data.

Sampling frame

A sampling frame defines the potential source of cases to be included in a sample or study. Sometimes, it is feasible to identify every member of a population and include any one of them in a sample--for example, when sampling items that come off a production line. More often, you are not able to access every possible case. For example, you cannot be sure who will vote in an election until after the election happens. In this case, you could use the electoral register as your sampling frame even if some registered people won’t vote. And some people might vote despite not having been listed at the time you checked the register. Anybody not in the sampling frame has no prospect of being sampled. Whether your sampling frame is close enough in nature to the population you are trying to evaluate is a question that must be addressed for each real-life case.