Wrappers for built-in analytics
The following sections describe the supported wrappers for different Netezza Analytics functions, their main arguments and parameters. The sections also provide simple examples of how to use these functions.
> help(nzGlm)
As most of the functions that are provided by the NZA package, are wrappers around Netezza Analytics functions, you can also take a look at the Netezza Analytics in-database analytics developer's guide and the Netezza Analytics in-database analytics reference guide for details. For example, if you are, interested in details about the link parameter for the nzGlm() function, the section on GLM in the Netezza Analytics in-database analytics developer's guide gives you a detailed description of individual link functions that are available.
Decision trees
The Netezza Analytics package provides a set of sophisticated algorithms. A simple example of a decision tree algorithm is presented below. In the subsequent sections, examples of other statistical and data mining algorithms are shown.
Assume that the adult data set is stored in the database MM in the table ADULT. In the code snippet below, a connection to the database is created, then a pointer to the table with data is made. Next, the decision tree model is fitted and finally, the fitted model is downloaded to R. All steps are transparent to the R user.
# loads necessary
packages library(nza)
# connects to the mm database nzConnect("user",
"password", "TT4-R040", "mm")
# creates a pointer to the data set stored in the table
adult nzadult = nz.data.frame("adult")
The nzadult
R object is a pointer
to the table ADULT
on the Netezza system. It is an object of the class
nz.data.frame
with overloaded functions like print(), [,] and others. As described
in The Netezza R library , it corresponds to a standard data.frame
object, but is
stored remotely in the database.The R functions that are wrappers on Netezza analytic routines can take this pointer as an
argument. This example uses the function nzDecTree() from the Netezza Analytics
Library for R package that builds the classification tree. The nzDecTree()
function is an R wrapper that prepares an SQL query that remotely calls the Netezza Analytics
DECTREE
stored procedure. The procedure runs remotely and the final model is
returned to R. Next, the model is converted to an object of the tree class that has a similar
structure to the objects that are created with the R tree() function.
ID
column from data set
ADULT
is specified in the id
parameter.# build a tree using built-in analytics
adultTree = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult, id="ID")
The function
output is stored in the database while the function nzDecTree() transforms it
by default to an R object of the class tree , which is specified by the package of the same name.
Therefore, overloaded functions such as print(), plot() or
predict() work with this object. It is possible to print or visualize the tree
in R, even if it was fitted for a large data set with millions of
rows.# plot and print the
tree plot(adultTree)
print(adultTree)
#node), split, n, deviance, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 0 small ( 0.24081 0.75919 )
# 2) AGE < 27 8031 0 small ( 0.03213 0.96787 ) *
# 3) AGE > 27 24530 0 small ( 0.30913 0.69087 )
# 6) SEX=Female 7366 0 small ( 0.15096 0.84904 ) *
# 7) SEX < >Female 17164 0 small ( 0.37701 0.62299 )
# 14) HOURS_PER_WEEK < 41 10334 0 small ( 0.29988 0.70012 ) *
# 15) HOURS_PER_WEEK > 41 6830 0 small ( 0.49370 0.50630 )
# 30) AGE < 35 1925 0 small ( 0.34649 0.65351 ) *
# 31) AGE > 35 4905 0 large ( 0.55148 0.44852 ) *
The fitted model can be applied to
another data set. If the data set is stored in a database table, massive data transfer can be
avoided by using the overloaded function predict() to do classification inside
Netezza.# use the previously created tree for prediction of the same
data adultPred = predict(adultTree, nzadult)
# download the prediction results into a data.frame
head(as.data.frame(adultPred))
# ID CLASS
#1 1 small
#2 2 small
#3 3 small
#4 4 small
#5 5 small
#6 6 small
nzDecTree
might also
return an object of class rpart
, as specified in the package of the same name. This
object also supports overloaded functions of rpart
such as
print(), plot() or
predict().adultRpart = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult,
id="ID", format="rpart")
plot(adultRpart)
print(adultRpart)
#n= 32561
#
#node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 NA small (0.240809.... 0.759190....)
# 2) AGE< 27 8031 NA small (0.032125.... 0.967874....) *
# 3) AGE>=27 24530 NA small (0.309131.... 0.690868....)
# 6) SEX=Female 7366 NA small (0.150963.... 0.849036....) *
# 7) SEX=<other> 17164 NA small (0.377010.... 0.622989....)
# 14) HOURS_PER_WEEK< 41 10334 NA small (0.299883.... 0.700116....) *
# 15) HOURS_PER_WEEK>=41 6830 NA small (0.493704.... 0.506295....)
# 30) AGE< 35 1925 NA small (0.346493.... 0.653506....) *
# 31) AGE>=35 4905 NA large (0.551478.... 0.448521....) *
Regression trees
The example below demonstrates regression trees. The basic idea is the same as for decision
trees. In this code snippet the WEATHERR
data set is used. The variable of interest
is grade, a continuous variable. The mean values of the variable grade are stored in each of the
corresponding leaves of the regression tree.
# loads necessary
packages library(nza)
# connect to the nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to the WEATHERR table is created
weatherr = nz.data.frame("WEATHERR")
To build a regression tree remotely, the
nzRegTree() function is used. It calls the Netezza Analytics stored procedure
REGTREE. Only the final model, that is, the parameters of the fitted tree, is
downloaded to R and transformed into an object of class tree, as specified by the package of the
same name. This example builds a regression tree for predicting the variables grade based on all
other variables within the data set. The ID
column from data set
WEATHERR
is specified in the id
parameter.wTree = nzRegTree(GRADE~., data=weatherr, id="INSTANCE",
minimprove=0.1, minsplit=2, maxdepth=4)
Overloaded functions such as
print() or plot() can be used to visualize the fitted
tree.# plot and print the
tree plot(wTree)
print(wTree)
#node), split, n, deviance, yval
# * denotes terminal node
#
# 1) root 22 NA 2.636
# 2) OUTLOOK=sun 8 NA 3.875
# 4) TEMPERATURE < 72 6 NA 4.500
# 8) TEMPERATURE < 52 3 NA 4.000 *
# 9) TEMPERATURE > 52 3 NA 5.000 *
# 5) TEMPERATURE > 72 2 NA 2.000 *
# 3) OUTLOOK < >sun 14 NA 1.929
# 6) OUTLOOK=cloudy 6 NA 2.833
# 12) TEMPERATURE < 12 1 NA 2.000 *
# 13) TEMPERATURE > 12 5 NA 3.000 *
# 7) OUTLOOK < >cloudy 8 NA 1.250
# 14) HUMIDITY=low 2 NA 2.000 *
# 15) HUMIDITY < >low 6 NA 1.000 *
This pre-built model can be applied to another data
set. If the data set is stored in the database, the overloaded predict()
function can be used to apply the regression tree inside the Netezza system. The
predict() function calls the PREDICT_REGTREE stored
procedure.# make prediction using the model wTree on table
weatherr wPred = predict(wTree, weatherr, id="INSTANCE")
# wPred is a nz.data.frame and can easily be
examined head(wPred)
# ID CLASS
#1 2 2
#2 6 1
#3 10 2
#4 14 1
#5 18 1
#6 22 1
nzRegTree
might also
return an object of class rpart
, as specified in the package of the same name. The
functionality is equal to the rpart
features in
nzDecTree():adultRpart = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nzadult,
id="ID", format="rpart")
plot(adultRpart)
print(adultRpart)
#n= 32561
#
#node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 32561 NA small (0.240809.... 0.759190....)
# 2) AGE< 27 8031 NA small (0.032125.... 0.967874....) *
# 3) AGE>=27 24530 NA small (0.309131.... 0.690868....)
# 6) SEX=Female 7366 NA small (0.150963.... 0.849036....) *
# 7) SEX=<other> 17164 NA small (0.377010.... 0.622989....)
# 14) HOURS_PER_WEEK< 41 10334 NA small (0.299883.... 0.700116....) *
# 15) HOURS_PER_WEEK>=41 6830 NA small (0.493704.... 0.506295....)
# 30) AGE< 35 1925 NA small (0.346493.... 0.653506....) *
# 31) AGE>=35 4905 NA large (0.551478.... 0.448521....) *
One-way and two-way ANOVA
Classic statistical functions such as ANOVA are available in Netezza Analytics. This section covers the one-way and two-way ANOVA.
In this sample, the WEATHERR
data set is used. The first instance is excluded so
that there is the same number of observations for each parameter value in the variable
HUMIDITY
.
WEATHERR
table. Also, exclude the first row of the
table.# loads necessary
packages library(nza)
# connect to the nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to weatherr table is created
weatherr = nz.data.frame("weatherr")
# select all rows form the table whose INSTANCE is bigger than
one weatherr = weatherr[weatherr$INSTANCE>1, ]
The HUMIDITY
and
OUTLOOK
columns in this table correspond to grouping variables. The column
TEMPERATURE
is a continuous variable. ANOVA is used to verify whether the mean
value of the TEMPERATURE
variable varies for different subgroups.The nzAnova() function remotely executes the ANOVA algorithm. The function takes a formula object as the first argument.
If there is one variable on the right side of the formulas, the function calls the
ANOVA_CRD_TEST algorithm of Netezza Analytics. If there are two variables on the
right hand side, the function calls the ANOVA_RBD_TEST algorithm of Netezza
Analytics. The function then transforms their results into an object of class
summary.aov
.
HUMIDITY
is chosen
as treatment variable.nzAnova(TEMPERATURE~HUMIDITY, weatherr)
# Df Sum Sq Mean Sq F value Pr(>F)
#HUMIDITY 2 275 137.3 0.265 0.77
#Residuals 18 9309 517.1
This example demonstrates the two-way ANOVA algorithm, where
HUMIDITY
is chosen as treatment variable and OUTLOOK
is chosen as
block variable.nzAnova(TEMPERATURE~HUMIDITY+OUTLOOK, weatherr)
HUMIDITY 2 275 137.3 0.337 0.7189
OUTLOOK 3 3206 1068.7 2.627 0.0884 .
Residuals 15 6102 406.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
K-means
Clustering methods are supported in the Netezza Analytics package.
The following k-means
clustering example uses the IRIS
data
set. The data set contains data regarding three different Iris species, that is
setosa
, versicolor
, and virginica
. Thus, there
are three clusters expected in the IRIS
data set, each related to a different
species.
IRIS
table.# loads necessary
packages library(nza)
# connect to nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to IRIS table is created
nziris = nz.data.frame("iris")
The nzKMeans() function does clustering remotely by calling the stored
procedure KMEANS. Cluster statistics, sizes, and mean values are the only data
that is downloaded to R and transformed into a object of class kmeans
, as described
in the stats package.
# K-Means for 2 clusters and euclidean
distance t3 = nzKMeans(nziris, k=3, id="ID")
# K-Means for 10 clusters and L1 distance
# also download item-cluster assignments for every item after computation t10
= nzKMeans(nziris, k=10, distance="manhattan", id="id", getLabels=T)
print(t3)
#KMeans clustering with 3 clusters of sizes 50, 51, 49
#
#Cluster means:
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464000 0.244000 5.006000 3.418000
#2 versicolor 4.264706 1.333333 5.915686 2.764706
#3 virginica 5.573469 2.032653 6.622449 2.983673
#
#Clustering vector:
#SELECT "ID","CLUSTER_ID","DISTANCE" FROM ADMIN.IRIS_MODEL59040
#
#Within cluster sum of squares by cluster:
#[1] 15.24040 32.93373 39.15551
#
#Available components:
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
# get used distance
metric t10$distance
#[1] "manhattan"
The returned object is implemented as a list and therefore, common list
operations might be used to retrieve single components of the result.names(t3)
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
t3$centers
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464000 0.244000 5.006000 3.418000
#2 versicolor 4.264706 1.333333 5.915686 2.764706
#3 virginica 5.573469 2.032653 6.622449 2.983673
# nzKMeans' parameter getLabels was unset when computing t3
# hence t3$cluster is of class nz.data.frame
head(t3$cluster)
# ID CLUSTER_ID DISTANCE
#1 4 1 0.5188372
#2 8 1 0.0599333
#3 12 1 0.2513802
#4 16 1 1.2130919
#5 20 1 0.3989887
#6 24 1 0.3794628
# nzKMeans' parameter getLabels was TRUE when computing t10
# hence t3$cluster is of class data.frame
head(t10$cluster)
#123456
#146615
res = predict(t3, nziris, id="ID")
head(res)
# ID CLUSTER_ID DISTANCE
#1 2 1 0.4381689
#2 6 1 0.6838070
#3 10 1 0.3665951
#4 14 1 0.9090611
#5 18 1 0.1509702
#6 22 1 0.3376270
The fitted model can be plotted by calling the overloaded
plot() function. This function will either download a sample of points by their
DISTANCE
, or download all points to produce a matrix of scatterplots like the
built-in function pairs() of R
does.# plot KMeans' result as matrix of
scatterplots plot(t10)
data.frames
.r = nzKMeans(nziris, k=3, id="ID", format="raw")
names(r)
#[1] "clusters" "columns" "column.statistics"
#[4] "model" "centroids" "modelname"
TwoStep
TwoStep is an alternative clustering algorithm to the k-means algorithm. Its main advantages are that it can determine the number of clusters automatically and that it can handle a mixture of categorical and continuous fields in a statistically sound way.
In the following clustering example, the IRIS
data set is used.
# loads necessary
packages library(nza)
# connect to nza database
nzConnect("user", "password", "TT4-R040", "nza")
# a pointer to IRIS table is created
nziris = nz.data.frame("iris")
nzTwoStep
function is used to cluster the
data.# TwoStep model for IRIS
t2 = nzTwoStep(nziris, id="id")
print(t2)
#TwoStep clustering with 3 clusters of sizes 50, 50, 50
#
#Cluster means:
# CLASS PETALLENGTH PETALWIDTH SEPALLENGTH SEPALWIDTH
#1 setosa 1.464 0.244 5.006 3.418
#2 virginica 5.552 2.026 6.588 2.974
#3 versicolor 4.260 1.326 5.936 2.770
#
#Clustering vector:
#SELECT " ID "," CLUSTER_ID "," DISTANCE " FROM ADMIN.IRIS_MODEL74356
#
#Within cluster sum of squares by cluster:
#[1] 0.3713140 0.5413910 0.4277027
#
#Available components:
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
The result object provides
information about cluster centers, cluster sizes, and sums of squares within each cluster. The
distance attribute stores the information about which distance metric is used.names(t2)
#[1] "cluster" "centers" "withinss" "size" "distance" "model"
t2$size
50 50
#[1] 50
t2$distance
#[1] "loglikelihood"
The fitted model can be applied to another data set. If the data set
is stored in the database, you can use the overloaded predict() function to do
classification on the Netezza system. The function returns the distance to the closest cluster and
its identifier.res = predict(t2, data, "ID")
head(res)
# ID CLUSTER_ID DISTANCE
#1 2 1 0.34278577
#2 6 1 0.54303152
#3 10 1 0.24827305
#4 14 1 0.67509943
#5 18 1 0.09122145
#6 22 1 0.21751088
DISTANCE
, or downloads
all points to produce a matrix of scatterplots like the built-in function
pairs() of R does.# plot TwoStep's result as matrix of
scatterplots plot(t2)
The function nzTwoStep() also supports a raw
output format. This format only downloads all tables that are created by the
KMEANS algorithm of Netezza Analytics and store them in a list of
data.frames
.r = nzKMeans(nziris, k=3, id="ID", format="raw")
names(r)
#[1] "clusters" "columns" "column.statistics"
#[4] "discrete.statistics" "model" "numeric.statistics"
#[7] "modelname" "cluster"