Managing data with the Netezza library for R
This section describes Netezza R library data types and some basic functions that allow data manipulation.
nz.data.frame
The most important and frequent construct is the object of the class
nz.data.frame. The function nz.data.frame() creates a pointer
to a table on the Netezza system. This pointer can later be used to run data transformations with
nzApply, or nzRun, or data mining algorithms. It does not
store any data in local memory but rather provides metadata that can be used to determine the
correct table subset (columns, or rows, or both) where the user code should run. It is the standard
output of the majority of data manipulation functions in the Netezza R
library.nzConnect("user", "password", "TT4-R040", "mm")
# create a reference to the table
adult nzadult = nz.data.frame("adult")
#Show the reference
nzadult
#SELECT
#ID,AGE,WORKCLASS,FNLWGT,EDUCATION,EDUCATION_NUM,MARITAL_STATUS,OCCUPATION,RELA#T
IONSHIP,RACE,SEX,CAPITAL_GAIN,CAPITAL_LOSS,HOURS_PER_WEEK,INCOME FROM ADULTThe
nz.data.frame class implements a number of methods for extracting a subset
of its data, gathering meta-info similar to data.frame, and working with parallel
data-processing algorithms. Netezza can be configured to work with schemas. Tables with schemas can
be referenced in the same way, using the schema name in addition to the table
name.# create a reference to the table adult
nzadult = nz.data.frame("a.adult")There is currently no support for cross-database access.
Both schemas and tables can be case-sensitive. In this case, they need to be put into to double
quotes. The following is an example for accessing table "Adult" (case-sensitive) in schema A (not
case-sensitive).
# create a reference to the table Adult
nzadult = nz.data.frame('A."Adult"')Columns in database tables are always treated as case-sensitive in R. Column names that are not defined as case-sensitive are transformed to default-case.
[,], $ and dim
A subset of columns, or rows, or both can be specified by using the [,]
operator.
A limitation is that rows cannot be referenced by their numbers because there is no continuous
row numbering on the Netezza system. Instead, you must specify value-based conditions, such
as
nzdf2 <- nzadult[nzadult$ID>20,]The result of each selection can be partially checked by using the dim()
function because at this stage, only metadata has been transferred to R. This function returns the
number of rows and columns.
# there are 15 columns and 32561 rows in the "adult"
table dim(nzadult)
#[1] 32561 15
# selecting columns 5,6 and
7 t1 <- nzadult[,5:7]
#SELECT EDUCATION,EDUCATION_NUM,MARITAL_STATUS FROM ADULT
# t1 has only 3 columns
dim(t1)
#[1] 32561 3
# selecting columns by their
names (t2 <- nzadult$AGE)
#SELECT AGE FROM ADULT
# t2 has only 1
column dim(t2)
#[1] 32561 1
# selecting rows satisfying condition AGE>30 and EDUCATIONNUM=10
# from the subset of columns number 1,2,5,7 and 16
t3 <- nzadult[nzadult[,2]>30 & nzadult[,6] == 10,c(1,2,6,7,15)]
t3
#SELECT ID,AGE,EDUCATION_NUM,MARITAL_STATUS,INCOME FROM ADULT WHERE ( AGE >
#'30' ) AND ( EDUCATION_NUM = '10' )
# there are fewer rows and fewer columns than in the initial
table dim(t3)
#[1] 4226 5head, tail
To get a sample of the data, you can use the head() and
tail() functions. The functions pull the specified data from the start or end
of the data set.
head(t3,4)
# ID AGE EDUCATION_NUM MARITAL_STATUS INCOME
#1 28 54 10 Married-civ-spouse large
#2 56 43 10 Married-civ-spouse large
#3 92 37 10 Divorced small
#4 140 49 10 Married-civ-spouse large
tail(nzadult[,1:4])
# ID AGE WORKCLASS FNLWGT
#32556 32538 30 Private 345898
#32557 32542 41 Private 202822
#32558 32546 39 Local-gov 111499
#32559 32550 43 State-gov 255835
#32560 32554 32 Private 116138
#32561 32558 40 Private 154374as.data.frame
To look at the complete data set, it must be downloaded from the Netezza system by using
as.data.frame. Because adult is a large data set, in the following example one of
the data frames created previously, (t3), is used
instead.reg_df <- as.data.frame(t3)
head(reg_df)
# ID AGE EDUCATION_NUM MARITAL_STATUS INCOME
#1 28 54 10 Married-civ-spouse large
#2 56 43 10 Married-civ-spouse large
#3 92 37 10 Divorced small
#4 140 49 10 Married-civ-spouse large
#5 204 42 10 Never-married small
#6 264 59 10 Married-civ-spouse small
class(reg_df)
#[1] "data.frame"as.nz.data.frame
Another useful data manipulation function is as.nz.data.frame. It creates an
nz.data.frame object from a different R object. Then, a Netezza system table is
created, and the passed data is inserted into this table. The created object points to the newly
created system table.
This example shows how an
nz.data.frame object can be created from another R
object, in this case from a data.frame iris.data(iris)
if (nzExistTable("nziris")) nzDeleteTable("nziris")
d = as.nz.data.frame(iris, "nziris")
d
#SELECT Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species FROM nziris
class(d)
#[1] "nz.data.frame"
#attr(,"package")
#[1] "nzr"The iris data set is now stored in a Netezza system table
NZIRIS. If the second argument is not specified, the table name is randomly
generated.Details
The function signatures with default arguments
are:
nz.data.frame(table1, case.sensitive = NULL)
as.data.frame(x1, row.names=NULL, optional=FALSE,
max.rows=NULL, order.by=TRUE, ...)
as.nz.data.frame(x2, table2 = NULL, distributeon = NULL, fast=TRUE) Where:- table1
- Specifies the name of a table available on the Netezza system in the currently-used database.
- row.names
- Not used, included for compatibility.
- optional
- Not used, included for compatibility.
- x1
- Specifies object to be coerced to
data.frame. - max.rows
- Optional. Specifies the maximum number of rows to be transferred to the client.
- order.by
- Optional. Denotes whether ordering should be used.
- x2
- Specifies object to be coerced to nz.data.frame.
- table2
- Optional. Specifies the table name; if not provided, the function selects a name.
- distributeon
- Optional. Specifies the column name; data distribution on the Netezza system is based on this column.
- fast
- Optional. if set to
FALSE, when creating a table, multiple inserts are performed. This option requires the data to be stored locally in a temporary file.