Best practices for accessing large data

Use the following best practices to access large data.

Access relational databases
Access Spark and Hadoop
Performance tuning

Access relational databases

When you load data from tables that contain more than a million rows, you might be accessing several gigabytes of data. As you transform and create various versions of data, no amount of memory you reserve will be adequate. For better access, consider filtering and load only a subset of the data that is relevant to your data science:

You can load and process, say, a million row at a time. The following python examples show how to perform batch access from RDBMS.

Batch reading: To better use memory, the batch processing should be used as follows instead of fetchall():

curs.execute('select * from people')
for row in curs:
    print row

or batch process many rows:

from __future__ import generators

def ResultIterator(cursor, arraysize=1000):
    'iterator using fetchmany and consumes less memory'
    while True:
        results = cursor.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result

# conn is a DB-API database connection 
curs = conn.cursor()
curs.execute('select * from HUGE_TABLE')

for result in ResultIterator(curs):
    doDatascienceOnResult(result)

Processing a row at a time:

row = curs.fetchone()
while row:
    print row
    row = curs.fetchone()

This is a much slower method, but better in terms of memory consumption.

Your read can be limited to several numbers of rows by using an SQL query specific to your RDBMS. For example, you can append the following command to your query:

Db2:

FETCH FIRST 100 ROWS ONLY

Netezza in R:

dbTableOrQuery = paste('(SELECT * FROM',dbTableOrQuery, 'LIMIT 100) AS TAB'

SQL Server:

SELECT TOP 3 * FROM EMP

Avoid storing relational data as large CSV or unstructured files as much as possible. This will reduce the memory requirement and increase the speed of data access.
Using SparkSQL and Dataframe to provide faster data access compared to other database JDBC clients.
If you have data in a relational database, RDBMS is much more optimal in filtering out the subset of data you want to perform data science on. Consider letting RDBMS process data as much as possible, and extract only the subset to load them to Watson Studio Local.
Define stored procedures that can perform most of the filtering and processing on your relational database itself. Also, ensure you define one complex stored procedure instead of calling multiple procedures. This can speed up your data science by only focusing on highly relevant data.

Data access permissions

When adding a new user to a project, ensure that the user has the correct and minimum required access to RDBMS (and nothing more).
Ensure that data loaded to Watson Studio Local or Hadoop by privileged users is using a privileged access to RDBMS, and is not unintentionally shared.
Carry out a periodic evaluation of any data being unintentionally shared with the collaborators.

Access Spark and Hadoop

As a best practice, processing should be close to your data. You could have large amount of data in HDFS or Hive tables. It is best to use the Spark running on the Hadoop cluster through Livy. This practice takes advantage of data localization, and avoids data transfer to speed up your processing. Because you are using the dedicated Spark instance on the Hadoop cluster, the performance might improve as well.

All runtime environments in Watson Studio Local (Jupyter, Zeppelin and RStudio) support accessing Hadoop Spark through Livy. For more details refer to remote spark documentation. By using Livy and remote Spark, you can load the large data into a Spark Dataframe from HDFS files or Hive tables.

If the data is local to the Watson Studio Local cluster and the data is not huge, then use the Watson Studio Local Spark instance.

Performance tuning

Generally, Watson Studio Local reads and writes to relational databases using JDBC and modules JayDebeApi and RJDBC/SparkR. Using database vendor specific clients such as cx_Oracle and pymssql can provide comparable performance to Spark SQL. Watson Studio Local writes database specific data frames using SQLAlchemy and database vendor specific adapters such as ibm_db_sa. Watson Studio Local does not support ODBC.

Tune Zeppelin notebook and interpreter settings

Option A: Set Zeppelin interpreter property and increase memory

To increase the memory sizes, complete the following steps:

Go to the Zeppelin interpreters by clicking on the wheel at the right top of the Zeppelin notebook window.
Click on the Interpreter link to open the Interpreter settings window.
Search for spark and select edit on the upper right.
Set spark.executor.memory to 12g. You might have to try increasing this as much possible to test.

Option B: Update the image using image management to set properties

In the /scripts/zeppelin-env.sh file in the image, add the following lines (what you set in the interpreter overrides this):

export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=12g"
export SPARK_SUBMIT_OPTIONS="--driver-java-options -Xmx20g"

You might also have to set ZEPPELIN_MEM and ZEPPELIN_INTP_MEM to something larger, for example:

export ZEPPELIN_MEM=-Xmx4g
export ZEPPELIN_INTP_MEM=-Xms1024m -Xmx4g -XX:MaxPermSize=4g

Use the updated image for creating the new runtime by stopping and starting the runtime.