Data Ingest with the Partition on Column Expression Feature

This post gives an overview of the approaches that can be used to ingest data into a table created with the partition by column expression feature. For an overview of this feature look at this post. There are several options available to ingest data into such tables. To demonstrate these options consider the following example:

    CREATE HADOOP TABLE INVENTORY (  trans_id int, product varchar(50), trans_ts timestamp  )     CREATE HADOOP TABLE INVENTORY_A (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY (      INT(cast(trans_ts as date))/10000 AS year_part,     INT(cast(trans_ts as date))/100 AS month_part,     INT(cast(trans_ts as date)) AS day_part  )

Approach 1: INSERT…SELECT

Use INSERT…SELECT to populate the new table using the data from an existing table. This option is available if you do not want to make changes to the underlying data. Big SQL will automatically generate the partition column values for you based on the existing data because it knows the relationship between the new column and the existing one. Suppose that the data resides in a table that was previously created, for example, the data resides in the INVENTORY table. You can use the following INSERT…SELECT statement to populate the INVENTORY_A table.

    INSERT INTO INVENTORY_A (trans_id,product,trans_ts) SELECT * from INVENTORY;

Few of the rows that are in both tables are shown below. Note that the year_part, month_part and day_part column values are automatically generated using this approach.

    select * from inventory limit 6;  +----------+--------------+-------------------------+  | TRANS_ID | PRODUCT      | TRANS_TS                |   +----------+--------------+-------------------------+       |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |  |     1000 | Hat          | 2016-06-13 15:30:00.000 |  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |  +----------+--------------+-------------------------+  select * from inventory_a limit 6;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |      2016 |     201606 | 20160613 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |      2016 |     201606 | 20160613 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |      2016 |     201606 | 20160613 |  |     1000 | Hat          | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |      2016 |     201606 | 20160615 |  +----------+--------------+-------------------------+-----------+------------+----------+

Approach 2: INSERT with DEFAULT option

Use INSERT with DEFAULT option to generate the column values automatically if you are just inserting a few rows into the table. This approach is not as effective for ingesting a lot of data into the table but is easy to use when you are inserting just a few rows for testing purposes. Each INSERT statement will generate a new partition directory and you could end up with lots of HDFS directories and small files which will not perform well for large amounts of data.

    INSERT INTO INVENTORY_A VALUES (1006, 'Bug Spray', '2016-06-17', DEFAULT, DEFAULT, DEFAULT);    $ SELECT * FROM INVENTORY_A where DAY_PART=20160617;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1006 | Bug Spray     | 2016-06-17 00:00:00.000 |     2016 |     201606 | 20160617 |  +----------+--------------+-------------------------+-----------+------------+----------+

Approach 3: LOAD HADOOP

If you can make changes to the underlying data use the LOAD HADOOP statement to populate the table. Generate the partitioning column values ahead of time as shown in the snippet of the summer2.dat file below, then use the LOAD HADOOP command to populate the table:

    1000|Hat|2016-06-13-15.30.37|2016|201606|20160613  1001|Sun Screen|2016-06-13-16.30.37 |2016|201606|20160613  1002|Water Bottle|2016-06-13-17.30.37|2016|201606|20160613  1003|Sandals|2016-06-13-15.30.37|2016|201606|20160613  1004|Lip Balm|2016-06-13-15.31.37|2016|201606|20160613  1005|Beach Towel|2016-06-15-15.30.37|2016|201606|20160615    load hadoop using file url '/user/hadoop/inventoryl/summer.dat' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true', 'date.time.format'='yyyy-MM-dd-HH.mm') into table inventory overwrite with load properties('num.map.tasks'='256');    select * from inventory limit 6;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |      2016 |     201606 | 20160613 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |      2016 |     201606 | 20160613 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |      2016 |     201606 | 20160613 |  |     1000 | Hat          | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |      2016 |     201606 | 20160615 |  +----------+--------------+-------------------------+-----------+------------+----------+

Note that if the file to be loaded does not contain all the columns of the table (as in summer.dat above), then NULL values will be inserted into the table for the partitioning key this is because LOAD does not automatically generate the column values. AVOID LOAD at all costs if you do not have the data generated ahead of time.

Approach 4: ALTER TABLE… ADD PARTITION

You can use the Hive or Big SQL ALTER TABLE… ADD PARTITION command to add entire partition directories if the data is already generated and stored on HDFS. The partitioning values have to be predetermine ahead of time as Big SQL and Hive requires partitioning values when adding partitions. For example, suppose the partitioning directories for 20170101 and 20170102 have already been generated with data in them, you can use the following commands to add these partitions:

    ALTER TABLE inventory      ADD PARTITION (year_part='2017', month_part='201701', day_part='20170101') location 'path_to_20170101'      ADD PARTITION (year_part='2017', month_part='201701', day_part='20170101') location 'path_to_20170102'  ...

Tips

Data Ingest with the Partition on Column Expression Feature - Hadoop Dev

Technical Blog Post

Abstract

Body

Approach 1: INSERT…SELECT

Approach 2: INSERT with DEFAULT option

Approach 3: LOAD HADOOP

Approach 4: ALTER TABLE… ADD PARTITION

UID

Share your feedback

Need support?