Netezza Analytics Library for R

The Netezza Analytics Library for R package is a standard CRAN- style R package. In this section, basic functions for using in-database analytics directly from the R client are reviewed.

System Prerequisites and Installation

To use the Netezza Analytics Library for R package, R must be available on the client machine and Netezza Analytics must be installed and registered on the Netezza system.

Introduction

The R environment offers a large number of functions for data analysis, model validation, model visualization, and data preprocessing. However, in the base R installation outside of the Netezza environment, the following bottlenecks might occur when processing large data sets:
Memory limit
In the base 32-bit R installation, users are limited to 4 GB or 2GB of RAM, depending on the operating system.
Processing speed
In the base installation, only one thread is allowed. As a result, even if R is working on a multicore machine, the time-consuming steps are not done at full speed. Although libraries that enable parallel computation exist, they require sophisticated configuration.
Method of accessing large data sets
In databases that are larger than several terabytes, the data sets are stored in a set of virtualized disks. Importing the data set to R in chunks and processing it step-by-step is not optimal. In most cases, it is much faster to run the analytic routines closer to the data instead of bringing the data to the R client for analysis.
This section describes how to use Netezza Analytics to do analytics for large data sets in R.
  • Netezza Analytics contains several built-in analytic routines for statistical and data mining algorithms. Because these algorithms are registered and executable from the database, they are fast and work close to the data. The results from these procedures, such as fitted models, model predictors, and so on, are then downloaded from the database to R. Then, the outcomes are transformed into R classes and made accessible in R for subsequent steps, such as processing or visualization.
  • Netezza Analytics contains routines for computing data aggregates in the database. These aggregates, which are usually much smaller than the data they stem from, can be computed in the database and then downloaded to R, where the rest of the computation is done. For many algorithms, this method of precomputing certain sufficient statistics in the database, then transferring them to R, and performing the remaining computation in R, greatly increases efficiency.