Running user-defined functions

User-defined functions can run either on each row, or on each group of rows given a grouping column. The first case is covered by nzApply(), the second functionality is realized by the nzTAapply() function. There are also two more flexible functions, nzRun() and nzRunHost() that allow users to iterate through the data manually.

nzApply

The nzApply() function applies a user-provided function to each row of a given distributed data frame (nz.data.frame). For each processed row, it expects at most one result row (vector, list) that is inserted into the output mz.data.frame.

data(iris)
if (nzExistTable('iris')) {nzDeleteTable('iris')}
d <-as.nz.data.frame(iris)
f <- function(x) { return(sqrt(x[[1]])) }
if (nzExistTable('apply_output')) nzDeleteTable('apply_output')
r <- nzApply(d[,1], NULL, f, output.name='apply_output',
output.signature=list(SQUAREROOT=NZ.DOUBLE))
head(r)
# SQUAREROOT
#1 2.645751
#2 2.626785
#3 2.366432
#4 2.366432
#5 2.366432
#6 2.366432
# this exists also as an overloaded apply method and the following
# returns the same result
nzDeleteTable('apply_output')
r <- apply(d[,1], NULL, f, output.name='apply_output',
output.signature=list(SQUAREROOT=NZ.DOUBLE))

When you apply a function to a table, be careful with data types. Either specify the exact subset of columns the types of which match the types that are expected by the function, or add casting of the columns to the desired format for the given function:

f <- function(x) { return(sqrt(as.numeric(x[[1]]))) }
if (nzExistTable('apply_output')) nzDeleteTable('apply_output')
r <- nzApply(d, NULL, f, output.name='apply_output',
output.signature=list(SQUAREROOT=NZ.DOUBLE))
head(r)
# SQUAREROOT
#1 2.258318
#2 2.213594
#3 2.213594
#4 2.258318
#5 2.258318
#6 2.258318

nzTApply

The nzTApply() function applies a user-provided function to each subset (group of rows) of a given distributed data frame (nz.data.frame). The subsets are determined by a specified index column. The results of applying the functions are put into a data frame. In the example below, the same nz.data.frame as in the nzApply() example is used. The example contains the iris data set.

print(d)
#SELECT Sepal_Length,Sepal_Width,Petal_Length,Petal_Width, Species FROM nziris
# the following lines do the same - compute the mean value
# in every group
nzTApply(d, d[,5], mean)
nzTApply(d, 'Species', mean)
nzTApply(d, 5, mean)
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species
#1 6.588 2.974 5.552 2.026 nan virginica
#2 5.006 3.428 1.462 0.246 nan setosa
#3 5.936 2.770 4.260 1.326 nan versicolor

Details

The output of these functions depends on whether output.name and output.signature are specified. For nzApply(), an object of class data.frame is returned. The object has the same number of columns as the sequences that are returned from fun. If the output.name is not provided, no table is created. For nzTApply(), if an output.name is provided, the output.signature must also be specified. The output.signature parameter can be used to avoid receiving a sparse table and to set the desired output columns types; if the parameter is provided, fun must return values that can be cast to these types.

If the fun function causes errors, the debugger mode can be used to investigate conditions where errors occur. When debugger.mode=TRUE, then the result table is not stored in the Netezza system. Instead, for every group a diagnostic test is called, and the environment for the first group that causes an error is transported to the local R client and opened in the R debugger.

Consider the following R code:

nziris = nz.data.frame('iris')
FUN5 = function(x) {
if(min(x[,1]) < 4.5) cov(0) else min(x[,1])
} nzTApply(nziris, 5, FUN5, debugger.mode=T)

While in debug mode, the function nzTApply() returns a summary for group processing. This summary is presented in a table with the following columns:

The first column contains the outcome or error description.
The second column contains the type of outcome (try-error in case of error).
The third column contains the group name for which the given result is returned In this example, there are three groups, where one group produces an error.

Found 1 error
values type group
1 101 integer virginica
2 supply both 'x' and 'y' or a matrix-like 'x' try-error setosa
3 51 integer versicolor

Then, for the first group that caused an error, a dumped environment is downloaded from the remote SPU to the R client and opened in the R debugger.

nzApply(X, MARGIN, FUN, output.name = NULL, output.signature =
NULL, clear.existing = FALSE, ...)
nzTApply(X, INDEX, FUN = NULL, output.name = NULL, output.signature = NULL,
clear.existing = FALSE, debugger.mode = FALSE, ..., simplify = TRUE)

Where:

X

Specifies the input data frame.

MARGIN

Currently not used but the argument is required; NULL must be passed.

FUN

Specifies the user-defined function.

FUN can return a scalar value or a row. It receives a subset of the input data in a form of a data.frame with columns names in lower case.

output.name

Specifies the name of the output table created on the Netezza system.

output.signature

Denotes the data types for output table columns. If not provided, a generic (sparse) table is created.

clear.existing

If TRUE, delete the output table if it currently exists.

debugger.mode

Ii TRUE, nzTApply works in debugger mode.

...

These arguments are passed to fun.

simplify

Not used, included for compatibility.

INDEX

The value used to index the data set where INDEX may be supplied as of the following items:

A character string the value of which must be present among columns of X.
An integer not greater than the number of columns of X.