IBM Support

Big SQL LOAD: Loading data into a partitioned table - Hadoop Dev

Technical Blog Post


Abstract

Big SQL LOAD: Loading data into a partitioned table - Hadoop Dev

Body

Partitioning a table on one or more columns allows data to be organized in such a way that querying the table with predicates which reference the partitioning columns results in better performance. Let’s take a look at how Big SQL’s LOAD HADOOP statement can be used to load data into a partitioned table.

Consider a partitioned Hadoop table defined as follows:

CREATE HADOOP TABLE employees  (     empno INTEGER,     lastname VARCHAR(15),     dept INTEGER  )  partitioned by (province VARCHAR(20), city VARCHAR(15))  

Data can be loaded into this table using either dynamic or static partitioning.

Dynamic Partitioning

This method does not require the values for partition columns to be known prior to loading data into a table using the LOAD HADOOP statement. The values are obtained from the input data. This can be very useful when dealing with large amounts of data where the full set of partition column values is not known and would be time-consuming to determine.

For example, let’s say we have input file /my_hdfs/data/employee_data2.csv which contains the following data:

(*** empno, lastname, province, city, dept ***)  100,Plotz,Quebec,Montreal,850  110,Ngan,British Columbia,Vancouver,750  120,Naughton,Ontario,Toronto,650  130,Yamaguchi,Ontario,Toronto,750  140,Fraye,Ontario,Toronto,650  150,Williams,Ontario,Ottawa,750  160,Molinare,British Columbia,Vancouver,550  170,Kermisch,Quebec,Montreal,650  180,Abrahams,British Columbia,Vancouver,850  190,Sneider,Ontario,Ottawa,650  200,Sanders,Ontario,Toronto,550  

As each row of the input file is read, it is loaded into a 2-part partition corresponding to the values for the partition columns, province and city:

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv'  WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4')  INTO TABLE employees  

The partition column values must be the last values specified for each record in the source file. In this case, we have reordered the source fields using ‘field.indexes’ so that they match the column order of the target table with the partition columns specified last.

To show all of the partitions defined for the table after the LOAD HADOOP statement has been executed, the SYSHADOOP.HCAT_TABLEPARTS view can be queried:

  SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition  FROM SYSHADOOP.HCAT_TABLEPARTS  WHERE tabname='EMPLOYEES'    PARTITION                                           --------------------------------------------------  province=British Columbia/city=Vancouver            province=Ontario/city=Ottawa                        province=Ontario/city=Toronto                       province=Quebec/city=Montreal                           4 record(s) selected.  

As well, the table can be queried to verify that the values for province and city remain as specified in the input file:

  SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno    EMPNO       LASTNAME        DEPT        PROVINCE             CITY  ----------- --------------- ----------- -------------------- ---------------          100 Plotz                   850 Quebec               Montreal          110 Ngan                    750 British Columbia     Vancouver          120 Naughton                650 Ontario              Toronto          130 Yamaguchi               750 Ontario              Toronto          140 Fraye                   650 Ontario              Toronto          150 Williams                750 Ontario              Ottawa          160 Molinare                550 British Columbia     Vancouver          170 Kermisch                650 Quebec               Montreal          180 Abrahams                850 British Columbia     Vancouver          190 Sneider                 650 Ontario              Ottawa          200 Sanders                 550 Ontario              Toronto      11 record(s) selected.  

The HDFS directories that are created as the result of partitioning are as follows:

  hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees    Found 4 items  drwxrwxrwx   - bigsql hadoop          0 2016-07-25 07:34 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats  drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=British Columbia  drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Ontario  drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Quebec  

Static Partitioning

This method requires that the values for partition columns be known ahead of time and they can then be specified in the PARTITION clause of the LOAD HADOOP statement.

For example, let’s say we have input file /my_hdfs/data/employee_data1.csv and it contains the data below:

(*** empno, lastname, dept ***)  100,Plotz,850  110,Ngan,750  120,Naughton,650  130,Yamaguchi,750  140,Fraye,650  150,Williams,750  160,Molinare,550  170,Kermisch,650  180,Abrahams,850  190,Sneider,650  200,Sanders,550  

Note that values for the partition columns are not included above.

By executing the statement below, all of the data will be loaded into a 2-part partition (province=’Ontario’, city=’Toronto’). Assume that no partitions exist yet for the table.

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data1.csv'  INTO TABLE employees  PARTITION (province = 'Ontario', city = 'Toronto')  

This can be verified by querying the SYSHADOOP.HCAT_TABLEPARTS view to show all of the partitions defined for the table:

  SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition  FROM SYSHADOOP.HCAT_TABLEPARTS  WHERE tabname='EMPLOYEES'    PARTITION  --------------------------------------------------  province=Ontario/city=Toronto    1 record(s) selected.  

As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:

  SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno       EMPNO       LASTNAME        DEPT        PROVINCE             CITY         ----------- --------------- ----------- -------------------- ---------------          100 Plotz                   850 Ontario              Toronto              110 Ngan                    750 Ontario              Toronto          120 Naughton                650 Ontario              Toronto          130 Yamaguchi               750 Ontario              Toronto                  140 Fraye                   650 Ontario              Toronto                  150 Williams                750 Ontario              Toronto          160 Molinare                550 Ontario              Toronto          170 Kermisch                650 Ontario              Toronto          180 Abrahams                850 Ontario              Toronto          190 Sneider                 650 Ontario              Toronto          200 Sanders                 550 Ontario              Toronto      11 record(s) selected.  

The HDFS directories that are created as the result of partitioning are as follows:

  hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees    Found 2 items  drwxrwxrwx   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats  drwx------   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario  

In practice, when using static partitioning, the partition column values should not be contained in the source file.
Now if the data does already include values for the partition columns, province and city, these values will be overridden by the values specified in the PARTITION clause of the LOAD HADOOP statement.

For example, let’s say input file /my_hdfs/data/employee_data2.csv contains the following data:

(*** empno, lastname, province, city, dept ***)  100,Plotz,Quebec,Montreal,850  110,Ngan,British Columbia,Vancouver,750  120,Naughton,Ontario,Toronto,650  130,Yamaguchi,Ontario,Toronto,750  140,Fraye,Ontario,Toronto,650  150,Williams,Ontario,Ottawa,750  160,Molinare,British Columbia,Vancouver,550  170,Kermisch,Quebec,Montreal,650  180,Abrahams,British Columbia,Vancouver,850  190,Sneider,Ontario,Ottawa,650  200,Sanders,Ontario,Toronto,550  

Regardless of the values indicated above for the partition columns, all of the data will be loaded into the same 2-part partition (province=’Ontario’, city=’Toronto’) with the statement below. Again, assume that no partitions exist yet for the table.

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv'  WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4')  INTO TABLE employees  PARTITION (province = 'Ontario', city = 'Toronto')  

The SYSHADOOP.HCAT_TABLEPARTS view can be queried to verify that only one 2-part partition is defined for the table:

  SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition  FROM SYSHADOOP.HCAT_TABLEPARTS  WHERE tabname='EMPLOYEES'    PARTITION  --------------------------------------------------  province=Ontario/city=Toronto    1 record(s) selected.  

As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:

  SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno    EMPNO       LASTNAME        DEPT        PROVINCE             CITY  ----------- --------------- ----------- -------------------- ---------------          100 Plotz                   850 Ontario              Toronto          110 Ngan                    750 Ontario              Toronto          120 Naughton                650 Ontario              Toronto          130 Yamaguchi               750 Ontario              Toronto          140 Fraye                   650 Ontario              Toronto          150 Williams                750 Ontario              Toronto          160 Molinare                550 Ontario              Toronto          170 Kermisch                650 Ontario              Toronto          180 Abrahams                850 Ontario              Toronto          190 Sneider                 650 Ontario              Toronto          200 Sanders                 550 Ontario              Toronto      11 record(s) selected.  

The HDFS directories that are created as the result of partitioning are as follows:

  hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees    Found 2 items  drwxrwxrwx   - bigsql hadoop          0 2016-07-25 08:30 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats  drwx------   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario  

[{"Business Unit":{"code":"BU054","label":"Cloud & Data Platform"},"Product":{"code":"SSCRJT","label":"IBM Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm16259915