Column statistics content model and pairwise statistics content model

The column statistics content model provides access to statistics that can be computed for each field (univariate statistics). The pairwise statistics content model provides access to statistics that can be computed between pairs of fields or values in a field.

The possible statistics measures are:

  • Count
  • UniqueCount
  • ValidCount
  • Mean
  • Sum
  • Min
  • Max
  • Range
  • Variance
  • StandardDeviation
  • StandardErrorOfMean
  • Skewness
  • SkewnessStandardError
  • Kurtosis
  • KurtosisStandardError
  • Median
  • Mode
  • Pearson
  • Covariance
  • TTest
  • FTest

Some values are only appropriate from single column statistics while others are only appropriate for pairwise statistics.

Nodes that will produce these are:

  • Statistics node produces column statistics and can produce pairwise statistics when correlation fields are specified
  • Data Audit node produces column and can produce pairwise statistics when an overlay field is specified.
  • Means node produces pairwise statistics when comparing pairs of fields or comparing a field's values with other field summaries.

Which content models and statistics are available will depend on both the particular node's capabilities and the settings within the node.

ColumnStatsContentModel API

Table 1. ColumnStatsContentModel API
Return Method Description
List<StatisticType> getAvailableStatistics() Returns the available statistics in this model. Not all fields will necessarily have values for all statistics.
List<String> getAvailableColumns() Returns the column names for which statistics were computed.
Number getStatistic(String column, StatisticType statistic) Returns the statistic values associated with the column.
void reset() Flushes any internal storage associated with this content model.

PairwiseStatsContentModel API

Table 2. PairwiseStatsContentModel API
Return Method Description
List<StatisticType> getAvailableStatistics() Returns the available statistics in this model. Not all fields will necessarily have values for all statistics.
List<String> getAvailablePrimaryColumns() Returns the primary column names for which statistics were computed.
List<Object> getAvailablePrimaryValues() Returns the values of the primary column for which statistics were computed.
List<String> getAvailableSecondaryColumns() Returns the secondary column names for which statistics were computed.
Number getStatistic(String primaryColumn, String secondaryColumn, StatisticType statistic) Returns the statistic values associated with the columns.
Number getStatistic(String primaryColumn, Object primaryValue, String secondaryColumn, StatisticType statistic) Returns the statistic values associated with the primary column value and the secondary column.
void reset() Flushes any internal storage associated with this content model.

Nodes and outputs

This table lists nodes that build outputs which include this type of content model.

Table 3. Nodes and outputs
Node name Output name Container ID Notes
"means" (Means node) "means" "columnStatistics"  
"means" (Means node) "means" "pairwiseStatistics"  
"dataaudit" (Data Audit node) "means" "columnStatistics"  
"statistics" (Statistics node) "statistics" "columnStatistics" Only generated when specific fields are examined.
"statistics" (Statistics node) "statistics" "pairwiseStatistics" Only generated when fields are correlated.

Example script

from modeler.api import StatisticType
stream = modeler.script.stream()

# Set up the input data
varfile = stream.createAt("variablefile", "File", 96, 96)
varfile.setPropertyValue("full_filename", "$CLEO/DEMOS/DRUG1n")

# Now create the statistics node. This can produce both
# column statistics and pairwise statistics
statisticsnode = stream.createAt("statistics", "Stats", 192, 96)
statisticsnode.setPropertyValue("examine", ["Age", "Na", "K"])
statisticsnode.setPropertyValue("correlate", ["Age", "Na", "K"])
stream.link(varfile, statisticsnode)

results = []
statisticsnode.run(results)
statsoutput = results[0]
statscm = statsoutput.getContentModel("columnStatistics")
if (statscm != None):
	cols = statscm.getAvailableColumns()
	stats = statscm.getAvailableStatistics()
	print "Column stats:", cols[0], str(stats[0]), " = ", statscm.getStatistic(cols[0], stats[0])

statscm = statsoutput.getContentModel("pairwiseStatistics")
if (statscm != None):
	pcols = statscm.getAvailablePrimaryColumns()
	scols = statscm.getAvailableSecondaryColumns()
	stats = statscm.getAvailableStatistics()
	corr = statscm.getStatistic(pcols[0], scols[0], StatisticType.Pearson)
	print "Pairwise stats:", pcols[0], scols[0], " Pearson = ", corr