In-database analytics using R

To process data, most native R functions require that the data first is extracted from a database to working memory. Such a function is called an in-application function. A different type of function, called an in-database function, operates directly on data in a database, without requiring the data to be extracted. Consequently, you can use an in-database function to analyze large amounts of data that would be impractical or impossible to extract.

In-database functions can use performance-enhancing features of the underlying database management system, such as columnar technology. Using in-database functions also avoids security issues that are associated with extracting data and ensures that the data that is being analyzed is as current as possible. Some in-database functions additionally use lazy loading to load only those parts of the data that are actually required, to further increase efficiency.

Both in-application and in-database functions are equally easy to use. An in-application function operates on an R construct called a data frame, which is a container that holds, in memory, a copy of the data to be processed. An in-database function operates on a similar construct called an IDA data frame. An IDA data frame does not hold any data directly. Instead, it holds a reference to a table or view in the database or to a selection of rows and columns within that table or view. When a function or method is applied to an IDA data frame, it is usually not run in the application, but is translated into an SQL query. The query is then run against the database, and the result is translated into an R object.

In-database analytics using SQL

Many of the in-database analytic functions that are provided by the ibmdbR package are also available as stored procedures that can be invoked by an SQL CALL statement. For example, both the idaNaiveBayes R function and the IDAX.NAIVEBAYES stored procedure can be used to create a naive Bayes model for predictive analysis. For more information, see Analytic stored procedures.