Wrappers for built-in analytics

The following sections describe the supported wrappers for different Netezza Analytics functions, their main arguments and parameters. The sections also provide simple examples of how to use these functions.

For a full list of options and parameters for each of these functions, use the R build-in help system. For example, to invoke the help page for nzGlm(), you can use the following command:

> help(nzGlm)

As most of the functions that are provided by the NZA package, are wrappers around Netezza Analytics functions, you can also take a look at the Netezza Analytics in-database analytics developer's guide and the Netezza Analytics in-database analytics reference guide for details. For example, if you are, interested in details about the link parameter for the nzGlm() function, the section on GLM in the Netezza Analytics in-database analytics developer's guide gives you a detailed description of individual link functions that are available.

Decision trees

The Netezza Analytics package provides a set of sophisticated algorithms. A simple example of a decision tree algorithm is presented below. In the subsequent sections, examples of other statistical and data mining algorithms are shown.

Assume that the adult data set is stored in the database MM in the table ADULT. In the code snippet below, a connection to the database is created, then a pointer to the table with data is made. Next, the decision tree model is fitted and finally, the fitted model is downloaded to R. All steps are transparent to the R user.

The first step is to make a connection to the database.

# loads necessary
packages library(nza)
# connects to the mm database nzConnect("user",
"password", "TT4-R040", "mm")
# creates a pointer to the data set stored in the table
adult nzadult = nz.data.frame("adult")

The nzadult R object is a pointer to the table ADULT on the Netezza system. It is an object of the class nz.data.frame with overloaded functions like print(), [,] and others. As described in The Netezza R library , it corresponds to a standard data.frame object, but is stored remotely in the database.

The R functions that are wrappers on Netezza analytic routines can take this pointer as an argument. This example uses the function nzDecTree() from the Netezza Analytics Library for R package that builds the classification tree. The nzDecTree() function is an R wrapper that prepares an SQL query that remotely calls the Netezza Analytics DECTREE stored procedure. The procedure runs remotely and the final model is returned to R. Next, the model is converted to an object of the tree class that has a similar structure to the objects that are created with the R tree() function.

Consider the following example of building a decision tree for predicting the variable income, based on the variables age, sex and hours per week. The ID column from data set ADULT is specified in the id parameter.

# build a tree using built-in analytics
adultTree = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult, id="ID")

The function output is stored in the database while the function nzDecTree() transforms it by default to an R object of the class tree , which is specified by the package of the same name. Therefore, overloaded functions such as print(), plot() or predict() work with this object. It is possible to print or visualize the tree in R, even if it was fitted for a large data set with millions of rows.

# plot and print the
tree plot(adultTree)
print(adultTree)
#node), split, n, deviance, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 0 small ( 0.24081 0.75919 )
# 2) AGE < 27 8031 0 small ( 0.03213 0.96787 ) *
# 3) AGE > 27 24530 0 small ( 0.30913 0.69087 )
# 6) SEX=Female 7366 0 small ( 0.15096 0.84904 ) *
# 7) SEX < >Female 17164 0 small ( 0.37701 0.62299 )
# 14) HOURS_PER_WEEK < 41 10334 0 small ( 0.29988 0.70012 ) *
# 15) HOURS_PER_WEEK > 41 6830 0 small ( 0.49370 0.50630 )
# 30) AGE < 35 1925 0 small ( 0.34649 0.65351 ) *
# 31) AGE > 35 4905 0 large ( 0.55148 0.44852 ) *

The fitted model can be applied to another data set. If the data set is stored in a database table, massive data transfer can be avoided by using the overloaded function predict() to do classification inside Netezza.

# use the previously created tree for prediction of the same
data adultPred = predict(adultTree, nzadult)
# download the prediction results into a data.frame
head(as.data.frame(adultPred))
# ID CLASS
#1 1 small
#2 2 small
#3 3 small
#4 4 small
#5 5 small
#6 6 small

As an alternative to the returned object of class tree, nzDecTree might also return an object of class rpart, as specified in the package of the same name. This object also supports overloaded functions of rpart such as print(), plot() or predict().

adultRpart = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult,
id="ID", format="rpart")
plot(adultRpart)
print(adultRpart)
#n= 32561
#
#node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 NA small (0.240809.... 0.759190....)
# 2) AGE< 27 8031 NA small (0.032125.... 0.967874....) *
# 3) AGE>=27 24530 NA small (0.309131.... 0.690868....)
# 6) SEX=Female 7366 NA small (0.150963.... 0.849036....) *
# 7) SEX=<other> 17164 NA small (0.377010.... 0.622989....)
# 14) HOURS_PER_WEEK< 41 10334 NA small (0.299883.... 0.700116....) *
# 15) HOURS_PER_WEEK>=41 6830 NA small (0.493704.... 0.506295....)
# 30) AGE< 35 1925 NA small (0.346493.... 0.653506....) *
# 31) AGE>=35 4905 NA large (0.551478.... 0.448521....) *

Regression trees

The example below demonstrates regression trees. The basic idea is the same as for decision trees. In this code snippet the WEATHERR data set is used. The variable of interest is grade, a continuous variable. The mean values of the variable grade are stored in each of the corresponding leaves of the regression tree.

First, a connection to the database and a pointer to the Netezza table are created.

# loads necessary
packages library(nza)
# connect to the nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to the WEATHERR table is created
weatherr = nz.data.frame("WEATHERR")

To build a regression tree remotely, the nzRegTree() function is used. It calls the Netezza Analytics stored procedure REGTREE. Only the final model, that is, the parameters of the fitted tree, is downloaded to R and transformed into an object of class tree, as specified by the package of the same name. This example builds a regression tree for predicting the variables grade based on all other variables within the data set. The ID column from data set WEATHERR is specified in the id parameter.

wTree = nzRegTree(GRADE~., data=weatherr, id="INSTANCE",
minimprove=0.1, minsplit=2, maxdepth=4)

Overloaded functions such as print() or plot() can be used to visualize the fitted tree.

# plot and print the
tree plot(wTree)
print(wTree)
#node), split, n, deviance, yval
# * denotes terminal node
#
# 1) root 22 NA 2.636
# 2) OUTLOOK=sun 8 NA 3.875
# 4) TEMPERATURE < 72 6 NA 4.500
# 8) TEMPERATURE < 52 3 NA 4.000 *
# 9) TEMPERATURE > 52 3 NA 5.000 *
# 5) TEMPERATURE > 72 2 NA 2.000 *
# 3) OUTLOOK < >sun 14 NA 1.929
# 6) OUTLOOK=cloudy 6 NA 2.833
# 12) TEMPERATURE < 12 1 NA 2.000 *
# 13) TEMPERATURE > 12 5 NA 3.000 *
# 7) OUTLOOK < >cloudy 8 NA 1.250
# 14) HUMIDITY=low 2 NA 2.000 *
# 15) HUMIDITY < >low 6 NA 1.000 *

This pre-built model can be applied to another data set. If the data set is stored in the database, the overloaded predict() function can be used to apply the regression tree inside the Netezza system. The predict() function calls the PREDICT_REGTREE stored procedure.

# make prediction using the model wTree on table
weatherr wPred = predict(wTree, weatherr, id="INSTANCE")
# wPred is a nz.data.frame and can easily be
examined head(wPred)
# ID CLASS
#1 2 2
#2 6 1
#3 10 2
#4 14 1
#5 18 1
#6 22 1

As an alternative to the returned object of class tree, nzRegTree might also return an object of class rpart, as specified in the package of the same name. The functionality is equal to the rpart features in nzDecTree():

adultRpart = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult,
id="ID", format="rpart")
plot(adultRpart)
print(adultRpart)
#n= 32561
#
#node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 NA small (0.240809.... 0.759190....)
# 2) AGE< 27 8031 NA small (0.032125.... 0.967874....) *
# 3) AGE>=27 24530 NA small (0.309131.... 0.690868....)
# 6) SEX=Female 7366 NA small (0.150963.... 0.849036....) *
# 7) SEX=<other> 17164 NA small (0.377010.... 0.622989....)
# 14) HOURS_PER_WEEK< 41 10334 NA small (0.299883.... 0.700116....) *
# 15) HOURS_PER_WEEK>=41 6830 NA small (0.493704.... 0.506295....)
# 30) AGE< 35 1925 NA small (0.346493.... 0.653506....) *
# 31) AGE>=35 4905 NA large (0.551478.... 0.448521....) *

One-way and two-way ANOVA

Classic statistical functions such as ANOVA are available in Netezza Analytics. This section covers the one-way and two-way ANOVA.

In this sample, the WEATHERR data set is used. The first instance is excluded so that there is the same number of observations for each parameter value in the variable HUMIDITY.

First, establish a connection to the database and create a pointer to the Netezza WEATHERR table. Also, exclude the first row of the table.

# loads necessary
packages library(nza)
# connect to the nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to weatherr table is created
weatherr = nz.data.frame("weatherr")
# select all rows form the table whose INSTANCE is bigger than
one weatherr = weatherr[weatherr$INSTANCE>1, ]

The HUMIDITY and OUTLOOK columns in this table correspond to grouping variables. The column TEMPERATURE is a continuous variable. ANOVA is used to verify whether the mean value of the TEMPERATURE variable varies for different subgroups.

The nzAnova() function remotely executes the ANOVA algorithm. The function takes a formula object as the first argument.

If there is one variable on the right side of the formulas, the function calls the ANOVA_CRD_TEST algorithm of Netezza Analytics. If there are two variables on the right hand side, the function calls the ANOVA_RBD_TEST algorithm of Netezza Analytics. The function then transforms their results into an object of class summary.aov.

This example demonstrates the one-way ANOVA algorithm, where HUMIDITY is chosen as treatment variable.

nzAnova(TEMPERATURE~HUMIDITY, weatherr)
# Df Sum Sq Mean Sq F value Pr(>F)
#HUMIDITY 2 275 137.3 0.265 0.77
#Residuals 18 9309 517.1

This example demonstrates the two-way ANOVA algorithm, where HUMIDITY is chosen as treatment variable and OUTLOOK is chosen as block variable.

nzAnova(TEMPERATURE~HUMIDITY+OUTLOOK, weatherr)
HUMIDITY 2 275 137.3 0.337 0.7189
OUTLOOK 3 3206 1068.7 2.627 0.0884 .
Residuals 15 6102 406.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

K-means

Clustering methods are supported in the Netezza Analytics package.

The following k-means clustering example uses the IRIS data set. The data set contains data regarding three different Iris species, that is setosa, versicolor, and virginica. Thus, there are three clusters expected in the IRIS data set, each related to a different species.

First, establish a connection to the database and create a pointer to the Netezza IRIS table.

# loads necessary
packages library(nza)
# connect to nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to IRIS table is created
nziris = nz.data.frame("iris")

The nzKMeans() function does clustering remotely by calling the stored procedure KMEANS. Cluster statistics, sizes, and mean values are the only data that is downloaded to R and transformed into a object of class kmeans, as described in the stats package.

The k-means algorithm splits the data into k clusters. The following example refers to k=3 and k=10.

# K-Means for 2 clusters and euclidean
distance t3 = nzKMeans(nziris, k=3, id="ID")
# K-Means for 10 clusters and L1 distance
# also download item-cluster assignments for every item after computation t10
= nzKMeans(nziris, k=10, distance="manhattan", id="id", getLabels=T)

To show clusters and their mean values, you can use the overloaded print() function.

print(t3)
#KMeans clustering with 3 clusters of sizes 50, 51, 49
#
#Cluster means:
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464000 0.244000 5.006000 3.418000
#2 versicolor 4.264706 1.333333 5.915686 2.764706
#3 virginica 5.573469 2.032653 6.622449 2.983673
#
#Clustering vector:
#SELECT "ID","CLUSTER_ID","DISTANCE" FROM ADMIN.IRIS_MODEL59040
#
#Within cluster sum of squares by cluster:
#[1] 15.24040 32.93373 39.15551
#
#Available components:
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
# get used distance
metric t10$distance
#[1] "manhattan"

The returned object is implemented as a list and therefore, common list operations might be used to retrieve single components of the result.

names(t3)
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
t3$centers
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464000 0.244000 5.006000 3.418000
#2 versicolor 4.264706 1.333333 5.915686 2.764706
#3 virginica 5.573469 2.032653 6.622449 2.983673
# nzKMeans' parameter getLabels was unset when computing t3
# hence t3$cluster is of class nz.data.frame
head(t3$cluster)
# ID CLUSTER_ID DISTANCE
#1 4 1 0.5188372
#2 8 1 0.0599333
#3 12 1 0.2513802
#4 16 1 1.2130919
#5 20 1 0.3989887
#6 24 1 0.3794628
# nzKMeans' parameter getLabels was TRUE when computing t10
# hence t3$cluster is of class data.frame
head(t10$cluster)
#123456
#146615

The fitted model can be applied to another data set. If the data set is stored in the database, you can use the overloaded predict() function to do classification on the Netezza system. The function returns the distance to the closest cluster and its identifier.

res = predict(t3, nziris, id="ID")
head(res)
# ID CLUSTER_ID DISTANCE
#1 2 1 0.4381689
#2 6 1 0.6838070
#3 10 1 0.3665951
#4 14 1 0.9090611
#5 18 1 0.1509702
#6 22 1 0.3376270

The fitted model can be plotted by calling the overloaded plot() function. This function will either download a sample of points by their DISTANCE, or download all points to produce a matrix of scatterplots like the built-in function pairs() of R does.

# plot KMeans' result as matrix of
scatterplots plot(t10)

The function nzKMeans() also supports a raw output format. This format only downloads all tables that are created by the KMEANS algorithm of Netezza Analytics and stores them in a list of data.frames.

r = nzKMeans(nziris, k=3, id="ID", format="raw")
names(r)
#[1] "clusters" "columns" "column.statistics"
#[4] "model" "centroids" "modelname"

TwoStep

TwoStep is an alternative clustering algorithm to the k-means algorithm. Its main advantages are that it can determine the number of clusters automatically and that it can handle a mixture of categorical and continuous fields in a statistically sound way.

In the following clustering example, the IRIS data set is used.

First, a connection to the database is established and a pointer to a Netezza table is created.

# loads necessary
packages library(nza)
# connect to nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to IRIS table is created
nziris = nz.data.frame("iris")

Now the nzTwoStep function is used to cluster the data.

# TwoStep model for IRIS
t2 = nzTwoStep(nziris, id="id")

To show clusters and their mean values, you can use the overloaded print() function.

print(t2)
#TwoStep clustering with 3 clusters of sizes 50, 50, 50
#
#Cluster means:
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464 0.244 5.006 3.418
#2 virginica 5.552 2.026 6.588 2.974
#3 versicolor 4.260 1.326 5.936 2.770
#
#Clustering vector:
#SELECT " ID "," CLUSTER_ID "," DISTANCE " FROM ADMIN.IRIS_MODEL74356
#
#Within cluster sum of squares by cluster:
#[1] 0.3713140 0.5413910 0.4277027
#
#Available components:
#[1] "cluster" "centers" "withinss" "size" "distance" "model"

The result object provides information about cluster centers, cluster sizes, and sums of squares within each cluster. The distance attribute stores the information about which distance metric is used.

As in nzKMeans(), the output of nzTwoStep() is stored in a list. Therefore, all common list operations are available.

names(t2)
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
t2$size
50 50
#[1] 50
t2$distance
#[1] "loglikelihood"

res = predict(t2, data, "ID")
head(res)
# ID CLUSTER_ID DISTANCE
#1 2 1 0.34278577
#2 6 1 0.54303152
#3 10 1 0.24827305
#4 14 1 0.67509943
#5 18 1 0.09122145
#6 22 1 0.21751088

The fitted model can be plotted by calling the overloaded plot() function. This function either downloads a sample of points by their DISTANCE, or downloads all points to produce a matrix of scatterplots like the built-in function pairs() of R does.

# plot TwoStep's result as matrix of
scatterplots plot(t2)

The function nzTwoStep() also supports a raw output format. This format only downloads all tables that are created by the KMEANS algorithm of Netezza Analytics and store them in a list of data.frames.

r = nzKMeans(nziris, k=3, id="ID", format="raw")
names(r)
#[1] "clusters" "columns" "column.statistics"
#[4] "discrete.statistics" "model" "numeric.statistics"
#[7] "modelname" "cluster"