LOAD HADOOP statement

Extracts data from external data sources and loads that data into a Hadoop or HBase table. Use this statement to load data from relational databases or delimited files that are in an external or local file system.

Authorization

You must have INSERT privileges granted to LOAD into a table with the APPEND option. You must have INSERT and DELETE privileges granted to LOAD into a table with the OVERWRITE option. The instance owner and the table creator already have the appropriate privileges to INSERT and DELETE.

Syntax

using-connection

from-source

columns-where

into-target

Notes:

¹ The HADOOP keyword is required unless the COMPATIBILITY_MODE global variable is enabled.
² The JDBC driver of the source database must be installed and Big SQL must be restarted before loading data. Alternatively, you can provide the JDBC driver JAR file location (see Adding new JDBC drivers). The JDBC JAR files are distributed to the data nodes when the Hadoop job is run.
³ Each data node must have connectivity to the data source server. To verify, ping the data source server from each of the cluster data nodes.
⁴ All property key names and values need to be single quoted. For example, PROPERTIES('field.delimiter'=',')

Description

By using the LOAD HADOOP USING command, you can import data from external data sources into target Big SQL tables. You can LOAD into HBase tables that are created by using Big SQL CREATE HBase table syntax, or into Hadoop tables that are defined with the Hive HBase storage handler.

JDBC CONNECTION source-url

Defines the connection as a JDBC connection and includes the connection information that is required to connect to your data source. The source-url specifies the type of remote data source, host name, or IP address of the source, port, and the name of the database to which you want to connect. Refer to the specific database documentation for examples of JDBC connection information.

Each data node must have connectivity to the data source. Try to ping the data source server from each node of the cluster.

The following example shows a DB2® connection:

'jdbc:db2://hostname.company.com:50000 /data_base_name'

Surround the value for URL with single quotes.

The URL can refer to any database that supports a JDBC connection. The type of connector is based on the source-url prefix. For example, 'jdbc:oracle:thin:@abc.com:port:...' implies that this connection is an Oracle connection.

Before you run the LOAD HADOOP USING command with a JDBC CONNECTION source-url, ensure that the system administrator has installed the JDBC driver JAR files into Sqoop, or that you provide the JDBC driver JAR file location (see Adding new JDBC drivers). For more information about setting up and using Sqoop, see Importing and exporting DB2 data by using Sqoop. See also WITH PARAMETERS 'driver'=.

The following list contains the currently supported databases.

DB2
Netezza®
Teradata
MS-SQL
Oracle
Informix®

Restriction:

You cannot load from a database source into tables that are defined with ARRAY or ROW data types.

FILE source-url

Defines a path to the source delimited file. The input path can be a single file or a directory that contains the files. The file system scheme of the URI that is provided, such as FTP, or SFTP, must reference one of the file systems configured in the Hadoop configuration. If the scheme and authority section of the URI is omitted, the default file system that is defined in the Hadoop configuration is used.

Access to the source files is controlled by the HDFS file permissions. Access can also be granted. If the files are not readable by the user, then the LOAD fails. You must have one of the following authorizations to access the source files:

DATAACCESS authority.
Privileges on the files and the directory in HDFS (and not Ranger):
- READ privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the FILE URL clause.
- EXECUTE privileges on all the directories and subdirectories in the specified FILE URL clause.

The input path can be a single file or a directory that contains the files. When the source-url path refers to a directory, all of the directories and files in that directory are accessed recursively. The contents of the source-url defines the type of access to the file. The following types of file access can be used:

FTP

Defines the source-url as an FTP path to access your source data.

The authentication information is specified as part of the source-url.

LOAD HADOOP USING FILE URL  
'ftp://biadmin:password@myserver.abc.com:22/home/biadmin/mydir/abc.csv'
...

SFTP

Defines the source-url as an SFTP path to access your source data.

You can specify the authentication information as part of the source-url.

LOAD HADOOP USING FILE URL 
'sftp://bigsql:password@myserver.abc.com:22/home/bigsql/mydir/abc.csv'
...

Note: SFTP source is supported for Big SQL on Hortonworks Data Platform 2.6.2 and above.

HDFS

Defines the source-url as an HDFS path to access your local or remote source data.

If you omit the hdfs:// URI, the LOAD HADOOP USING command assumes that it is an HDFS scheme:

LOAD HADOOP USING FILE URL 
'/user/bigsql/mydir/abc.csv'

You can access files on a cluster that does not require authentication.

LOAD HADOOP USING FILE URL 
'hdfs://myserver.abc.com:9000/user/bigsql/mydir/sample.csv'
...

WITH PARAMETERS connection-parameters

Specifies parameters that are used to connect to your data source. An alternate syntax is WITH CONNECTION PROPERTIES connection-parameters

Each parameter is a parameter name and parameter value pairing as shown in following example:

connection-parameter-name = connection-parameter-value

The key must be surrounded with single quotes. Surround string values with single quotes. No quotes are needed for numerical values.

The following are valid parameters:

'credential.file'='fullpath'

Specifies the full absolute distributed file system (DFS) path and the name of the file that contains the credentials. This property can be used for JDBC database sources only.

The bigsql user ID and the user that runs LOAD both need to have READ access to this file.

You can use a credential file instead of the user and password parameters. The contents specify the following required name=value pairs:

JDBC credential file


user=myname
password=mypass

For example,


LOAD HADOOP USING ... WITH PARAMETERS 
 ( 'credential.file'=
   '/tmp/credstore/sample.properties')

'credential.file.passphrase'='passphrase'

Specifies the passphrase to decrypt the credential file. This property must be used with the 'credential.file' property. It can be used for JDBC database sources only.

Use the credential file and passphrase with these steps:

Create a local properties file that contains the user and password properties.
Run the utility file-encryption.sh to encrypt the file.
Upload the encrypted file to HDFS.

For example,


/usr/ibmpacks/current/bigsql/bigsql/libexec/file-encryption.sh 
  -encrypt
  -textfile sample.properties 
  -encryptedfile sample_encrypted.properties  
  -passphrase testit


hadoop fs -put sample_encrypted.properties /tmp/credstore


LOAD HADOOP USING ... WITH PARAMETERS (
 'credential.file'=
    '/tmp/credstore/sample_encrypted.properties',
    'credential.file.passphrase'='testit')

'user'='username'

Specifies a user ID to connect to the data source as shown here for example: 'user'='john'.

'password'='password'

Specifies a user password to connect to the data source as shown here for example: 'password'='passwd123'. The password property can be encoded.

'driver'='com.somedb.jdbc.somedbDriver'

Optionally specify the JDBC driver class to use. When you specify the JDBC driver class, the Sqoop generic connection manager is used. When you do not specify a JDBC driver class, a specific Apache Sqoop connection manager is selected by default. The selection is based on the connection URL database type. For example, to use the Informix database, the 'driver' value might be 'com.informix.jdbc.IfxDriver'.

Note: The Informix database must have the transaction logging enabled to use the LOAD HADOOP USING command with the Sqoop connection manager.

You must also copy the JDBC driver JAR file for the specific database into the /usr/hdp/4.x.0.0/sqoop/lib directory.

You do not need to specify the JDBC driver class for the following types of databases:

DB2
Teradata
Netezza
Oracle
MS SQL Server

'driver.location'='sourcepath'

Optionally specify the path to a JDBC driver JAR file that is needed for the LOAD HADOOP USING statement. The path must be a directory. For more information, see Adding new JDBC drivers.

'any.jdbc.prop1'='value'

You can add as many JDBC connection properties as you need.

For example,

 
LOAD HADOOP USING ... WITH PARAMETERS
 ('user'='admin' , 'password'='abc',
   'currentQueryOptimization'='true' )

FROM SQL QUERYselect-statement

Specifies the SQL query that is used to get data from the data source. The columns in the select clause must match both the names and the types of the columns in the target table definition. The query must have a WHERE clause that contains a $CONDITIONS marker. The $CONDITIONS marker is replaced when the LOAD command is run with a condition expression that is unique for each map task. In this case, the SPLIT COLUMN must be specified:

FROM SQL QUERY 'SELECT cust.id, cust.name, ords.itemname 
               FROM customer cust, orders ords 
               WHERE cust.id = ords.custid 
               AND $CONDITIONS' SPLIT COLUMN ords.custid INTO TABLE orders_q

Surround the value for QUERY with single quotes. If a single quote is needed inside the value then use two single quotes.

… FROM SQL QUERY 'SELECT ID, NAME, 
   STATE, CITY FROM CITIES 
   WHERE CITY=''San Jose'' AND $CONDITIONS'
              ...

If the SQL QUERY clause is specified, the user must have a $CONDITIONS marker in the query

Example query:
Q-1 'select * from t where c1 = 1 AND $CONDITIONS'.
Q-2 'select * from t where $CONDITIONS'.

FROM TABLE table-name COLUMNS (column-list) WHERE search-condition

Instead of using the SQL QUERY clause, you can specify TABLE, COLUMNS, and the WHERE clause separately. In this case, the LOAD command generates the SQL query. The table name and the column names are case sensitive. Type them in the same case that the database is expecting.

IBM® Big SQL uses <table name>, <columns>, and <search-condition> to generate a SQL statement such as the following code:

SELECT columns FROM table-name WHERE search-condition

The columns list must match the order of the target table columns.

table-name

Specifies the name of the source table in the exact same case as defined in the source database.

The table name can be either a 1- or 2-part name.

column-list

Specifies the subset of columns in the same case to select the source data. If this property is not specified when a table-name is specified, all columns in the table are imported. The column names must be separated by commas:

...COLUMNS(id, name, city)...

search-condition

... WHERE 'id > 99' ...

Surround the value for WHERE with single quotes. If a single quote is needed inside the value then use two single quotes.

… WHERE ' CITY=''San Jose'' ' …

SPLIT COLUMN column-name

This option specifies the column to use to distribute data across all tasks that are involved in the data loading process. It is not required if the number of mappers is set to 1. This option can be omitted when using the FROM TABLE clause if the source table has a primary key.

...SPLIT COLUMN customers.id...

This property is valid for database tables only. The split column must not be of type BIGINT.

Specifying this option helps to ensure that load operations are done in parallel. By default, the primary key column is used for splitting the data. Use the split column to select a range of values to attempt to uniformly distribute data across mappers. However, be aware that selecting a split column with few unique values might restrict the LOAD operation to using as many tasks as there are unique values in the column. The performance and efficiency of the split operation is influenced by the variety of the values.

SPLIT COLUMN is required when the number of mappers is greater than one. If SQL QUERY is specified, the split column is one of the columns that are specified in the select column list. If SPLIT COLUMN is not specified, a primary key column is used. If there is no primary key, an error is reported and the LOAD fails.

SPLIT COLUMN is ignored when source data is extracted from DB2 DPF (data-partitioning-feature) or Netezza, and you specify TABLE-COLUMN-WHERE. For DB2 DPF and Netezza, the data is split by the database partitions instead of by value. However, for DB2 z/OS® DPF, the data is split by value, not by database partition. In all the other cases where the number of mappers is greater than one, you must specify SPLIT COLUMN.

WITH SOURCE PROPERTIES (attribute=value)

These properties describe how to interpret the data in the source delimited files. Specify all properties with single quotes. For example: PROPERTIES('field.delimiter'=',')

These properties are valid for loading from files only.

The following properties are supported:

'file.charset'='<character_set_encoding_type>'

The standard Java™ character set encodings are supported. The default is 'UTF-8'.

...'file.charset'='UTF-16'...

'date.time.format' = '<format_type1>'

You can repeat this property to define all of the accepted date-time formats for your input file. For example,

 ...WITH SOURCE PROPERTIES(
 'field.delimiter'=',',
 'date.time.format'='yyyy-MM-dd-HH.mm.ss.S',
 'date.time.format'='yyyy-MM-dd HH.mm.ss.S', 
 'date.time.format'='yyyy-MM-dd') 
 INTO ...

These formats are attempted in order for each field that corresponds to a target TIMESTAMP column. Use the following guidelines:

Do not mix ambiguous patterns. For example, mixing the date patterns 'dd-MM-yy' and 'yy-MM-dd' is ambiguous, because date strings like 02-02-02 can be interpreted with either pattern. If you specify a mix of ambiguous patterns, you might see incorrect values.
The first pattern that is applicable is used.
Specify the longest patterns first in the order.
One reason to specify the longest pattern first is that if the data is longer then the pattern, the LOAD processing might ignore information beyond the pattern.
Specify the most specific formats first. For example, specify a pattern that starts with a four-digit year (yyyy) before a pattern that starts with a two-digit month (MM). Otherwise, when the two-digit pattern is parsed, it might be misinterpreted as a short two-digit year. The SimpleDateFormat pads year values so that if you include both of these formats, specify them in the following order:
```
...'date.time.format'='yyyy-MM-dd',
    'date.time.format'='MM-dd-yyyy' ...
```

For more information about the formats, see Java simple date/time formats.

'field.delimiter'='char'

Defines the character that you use to separate the source fields. The default character is a comma (,).

The following examples show ways of defining the field.delimiter character:

char: 'field.delimiter'='|'
octal number: 'field.delimiter'='\001'
escaped char: 'field.delimiter'='\t'

'quote.char'='char'

Defines the character that encloses a field which might contain delimited characters. The default character is the double-quotation (").

You can use this property to quote the input data string that maps to the ARRAY or ROW data type column. Do not use this property to quote ARRAY or ROW data type field values.

'escape.char'='char'

Specify an escape character that you use in your source data to make the quote character part of a string data sequence.

You can use this property with ARRAY or ROW data types. If the values in the fields contain the quote or the field delimiter character part of a string data sequence, then you must escape these values.

skip.lines.count='int'

Defines the number of lines from the top of each source file to skip before reading the data.

'field.indexes' = '<1>,...,<N>'

Use this property with a comma separated list of field indexes, to specify the fields that are to be included in the LOAD. Each input field is designated by an index number, beginning with one. The field indexes are mapped to the target columns by position.

You can use this property to select a subset of the fields and reorder the fields to match the column order of the target table.

...'field.indexes' = '1, 5, 3, 6'...

'allow.multiline.record'='true/false'

Determines if newline characters can appear within a field. Fields that contain new line characters must be enclosed in quotations. When you use this option, if an end quote is not found, the entire file is read as a single field. The value can be either true or false. The default value is false.

When the value is set to true, the source files are not split, even if they are larger than the block size. Each file is processed by one map task.

'ignore.extra.fields'='true/false'

Defines whether a file can contain more fields than target columns. The value can be either true or false. The default is false. If this property is set to false, any record with more fields than target columns is rejected.

Do not use this property for the elements of ARRAY or ROW data types. If there is additional data, it is discarded for ROW and associative ARRAY data types. For ordinary ARRAY data types, you can specify the array.overflow property to handle extra items in the input.

'replace.with.null'='<string>'

If your source file contains field values that map non-string data type target columns that you want to change to null, use this parameter to specify a character string to be replaced by a null value during the LOAD process. For example, if you specify

...WITH
SOURCE PROPERTIES ('replace.with.null'='NA')...

, then any field value of 'NA' that is found in the source file mapping to a non-string type column is changed to null in the target table.

This property applies to all non-string type columns, such as INTEGER, SMALLINT, TINYINT, DOUBLE, FLOAT, DATE, TIMESTAMP . This property applies to loading from files.

'replace.string.with.null'='<string>'

If your source file contains field values that map to string data type target columns that you want to change to null, use this parameter to specify a character string to be replaced by a null value during the LOAD process. For example, if you specify ...WITH SOURCE PROPERTIES ('replace.string.with.null'='NA')..., then any field value of 'NA' that is found in the source file mapping to a string type column is is changed to null in the target table. When this property is specified, empty input fields, and empty strings are considered as empty strings. Only the input string that matches the user specified value will be considered as a null value. If this property is not specified, then empty fields, and empty quoted strings will be considered as null values.

This property applies to all string type columns, such as CHAR,CHARACTER,VARCHAR, STRING. This property applies to loading from files. If the source file contains a specific string to represent null values for all data type columns, such as 'NULL' string , then you need to specify ...WITH SOURCE PROPERTIES ('replace.with.null'='NULL', 'replace.string.with.null'='NULL' ).... For example, assume that 'replace.strings.with.null'='NA':

Input text	Big SQL table	Comments
"1","NA","Mike"	1,NULL,Mike.	Recognized as a BigSQL NULL value
"2",,"John"	2,,John	Recognized as Empty string
"3","","Smith"	3,,Smith	Recognized as Empty string

'field.binary.encoding'

Specifies the encoding of a binary field in the input file. The valid values for this property are base64 or hex. The default is base64.

'collection.items.delimiter'='<char>'

Defines the character that you use to separate the collection item fields of an ARRAY or ROW data type. This property can be repeated to specify delimiters for nested levels.

LOAD supports as many levels of nesting as are supported by the CREATE TABLE statement. The order of the characters that are defined must match to the nested levels of the ARRAY or ROW data type column. The default values are ‘\u0002’, ‘\u0003’, and '\u0004'.

Simple ARRAY and ROW data types: The default delimiter = '\u0002'.
Associative ARRAY: The default delimiter of map items = '\u0002'.; The default delimiter of map keys = ‘\u0003'.
Nested ordinary ARRAY and ROW data types: In an ARRAY of ROWs, the default delimiter of ARRAY items = ‘\u0002’, and the default delimiter of the ROW fields = ‘\u0003'.; In an associative ARRAY of ROWs, the default delimiter of ARRAY items = ‘\u0002’, and the default delimiter of ARRAY keys=‘\u0003’, and the default delimiter of ARRAY values (the ROW fields)='\u0004'.

'map.keys.delimiter'='<char>'

Defines the character that you use to separate the map key and the value. This delimiter is an alias for second level delimiter for collection items of an ARRAY or ROW data type. The default value is ‘\u0003’.

INTO TABLE table-name

The table-name parameter specifies the target table into which your source data is loaded. You can optionally qualify the table-name with the schema (schema-name.table-name). If a schema is not specified as part of the table-name, the current schema is used. An error is returned if the table-name does not exist. Create the target table before you run the LOAD HADOOP USING statement.

The table type can be any Hadoop table. For example, you can load data into an HBase table that is defined by using a Hive HBase storage handler.

PARTITION (partitions)

A partition is a list of paired column names and values. A partition column name must match a partition name in the target table. You specify a partition column value as you would a predicate specification. The following examples illustrate a partition-value:

For an integer

age = 25

For a variable character

name = 'John Doe'

Tables are partitioned by using the CREATE and ALTER table statements. PARTITION is only applicable for a Hadoop target table type, which is partitioned.

- If you specify none or some of the target table partitions as LOAD partitions, Big SQL LOAD discovers missing partition names, assigns values from the source data, to create a new partition for the missing partitions.

For large partitioned tables, you might see better performance by using an INSERT INTO...SELECT statement. Create an external table that provides a definition for the input data first.

You cannot use this parameter when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH)

When an INSERT or LOAD operation creates new partitions, if the table is explicitly cached, then the new partitions will also get cached in the same pool on the next LOAD or INSERT operation on the table. For more information about caching, see HDFS caching.

APPEND or OVERWRITE

Specifies whether to APPEND source data to the data in the target table or to OVERWRITE the data in the target table with the source data. The rules for APPEND and OVERWRITE depend on several factors:

When loading into HBase tables, if the row with the key already exists, it is overwritten with new data. If the row does not exist, the new row is added to the table. For HBase tables, the OVERWRITE option does not apply. An invalid option error is issued.
For Hadoop tables that are not partitioned, you can specify APPEND or OVERWRITE.
For Hadoop tables that are partitioned, and all partition values are specified for all of the partition keys, then the specified partition must be empty.
OVERWRITE is not supported for Hadoop tables that load from a Teradata data source by using the Teradata Connector for Hadoop ('tdch.enable'='true'). If you do use the OVERWRITE parameter, you see the following error message: The "OVERWRITE" option is incompatible with the "tdch.enable" option.
If you specify none, or do not specify partition values for of all the partition keys, you can specify APPEND or OVERWRITE. For tables with partitions in default locations (under the table path), Big SQL LOAD discovers missing partition names and values and then creates the required partitions. However, a Hadoop table can have partitions in different locations than the table location. In this case, when partitions have non-default locations (not under the table path), do not specify the OVERWRITE clause.

WITH TARGET TABLE PROPERTIES (attribute=value)

These properties can apply to loading from all sources. The key must be surrounded with single quotes. Surround string values with single quotes. No quotes are needed for numerical values.

Properties that are common to all table types

'column.names'='col1,...,colN'

This target table property applies to all types of tables including HBase tables.

This property defines the list of target columns to be loaded. When this property is set, the source and target columns are matched by position. The number of source columns selected to import should match the number of target columns that are specified here. This option can be used to specify a subset of the columns in the target table. Omitted columns are set to null. This property can apply to files or tables.

This option can be used with the 'field.indexes' option to map specific fields in a file to specific columns in the target table.

This property can apply to files or tables.

...'column.names'='NAME, DEPT, ID'...

You cannot use this property when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH)

For files, you can specify ARRAY or ROW data type columns as target columns. But, you cannot specify the individual elements of the ARRAY or ROW data type column.

'array.overflow'= 'keep|reject|truncate'

Use this property to specify the LOAD actions if the input has more items than the target ARRAY data type length. The value can be keep, reject, or truncate.

keep: If you specify 'array.overflow'='keep', then all of the items are kept.
reject: If you specify 'array.overflow'='reject', then the row is rejected if the input contains more items than the ARRAY length.
truncate: If you specify 'array.overflow'='truncate', extra items in the input are discarded. These items are not loaded into the table.

'reject.nulls'='true|false'

By using this property, you can reject records that have a value of NULL if the target column does not allow NULL values. The value can be true or false. The default is true. If you specify 'reject.nulls'='true', then the row is rejected. Disable the property by setting 'reject.nulls'='false'.

This applies to NULL values, not values that contain the text "NULL".

'decimal.overflow'='reject|round_up|round_down'

By using this property, you can enforce the precision and scale for DECIMAL target data types. The value can be reject, round_up, or round_down. The default is reject.

reject: If you specify 'decimal.overflow'='reject', then the row is rejected if the precision or scale does not match the target.
round_up | round_down: If you specify 'decimal.overflow'='round_up' or 'decimal.overflow'='round_down', the input value is rounded up or down to match the target precision and scale.

After rounding, if the value still does not fit the target precision and scale, then the record is rejected.

You can use this property for ARRAY and ROW fields. For example, if a ROW type field is defined as decimal, and the input value does not fit into the defined scale, the row is rejected, if you specify 'decimal.overflow'='reject'.

Properties that are specific to Hadoop tables

'drop.delimiter.chars'='true'|'false'

Indicates that the characters \n, \r, and \001 are to be dropped from string data.

This property is applicable only for tables that use the storage format, TEXTFILE. The LOAD USING fails if you use this property with any storage format that is not TEXTFILE.

You cannot use this property when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH).

You can use this property with ARRAY or ROW data types.

'replace.delimiter.chars'

Indicates that the characters \n, \r, and \001 in the string data are to be replaced by the specified character. This property and drop.delimiter.chars are mutually exclusive.

Warning: If you do not specify either property, and if the data includes \n, \r or \001 as delimiters, data can become unreadable as part of the LOAD.

This property is applicable only for tables that use the storage format, TEXTFILE. The LOAD USING fails if you use this property with any storage format that is not TEXTFILE.

You cannot use this property when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH).

You can use this property with ARRAY or ROW data types.

Properties that are specific to HBase tables

hbase.load.method

The possible values are put or bulkload. The default value is put. This property can apply to files or tables.

When bulkload is used, Big SQL outputs the data in the internal data format of HBase. It then directly loads the generated files. If you import into an empty table and the amount of data is large, consider pre-splitting the data for performance. A temporary directory is created under the bigsql user directory in the distributed file system (DFS). You can change this directory by using the hbase.bulkload.workdir property.

If the put option is used, then individual records are inserted into the HBase table. With this option you can also specify the hbase.disable.wal parameter.

hbase.disable.wal

This parameter determines if the write-ahead logging for HBase writes and puts are disabled. By disabling write-ahead logging, you increase the performance of write operations, but you can incur data loss if the region servers fail. Your client application is responsible for ensuring data consistency when you use this option. The value is either true or false. The default is false.

This parameter is applicable only if hbase.load.method=put.

hbase.bulkload.workdir

This optional property specifies a temporary directory for creating hfiles during the load into HBase tables using the bulkload option. If this property is not used, a temporary directory is created under the bigsql user directory.

hbase.timestamp

This property can be used to specify the version for all the source rows. The specified value must be long integer. By default, the system uses the server current time in milliseconds.

WITH LOAD PROPERTIES properties

The key must be surrounded with single quotes. Surround string values with single quotes. No quotes are needed for numerical values.

'rejected.records.dir'

This property defines a target location for records that are not imported or loaded into the target table. The directory can be defined in the HDFS file system only. The default is that no directory is defined. This property can apply to files or tables.

An example of a reason that records might be rejected is because the data types do not match between the source and the target.

When the LOAD completes, and your source table contains records that cannot be loaded, an informational message is issued that specifies the number of records that were rejected. If the directory is specified, then the reject record files are created in the directory.

If this parameter is not used, then the rejected records are discarded. The map task logs contain error information with input records that failed for the first 100 rejected records

Note: In Hadoop, a record is defined by the input format in a data source.

You cannot use this parameter when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH)

For more details about rejected records, see Access restrictions to rejected records.

'num.map.tasks'

The maximum number of map tasks to create. If not specified, the default is 4. If the source files are local to the cluster, then the default number of map tasks is 80% of the cluster map task capacity.

By using this parameter, you can control the maximum number of map tasks that are created during the LOAD. You can improve the performance by increasing the number of map tasks that are assigned to the LOAD.

'num.reduce.tasks'

The number of reduce tasks that are used to perform the LOAD operation. If not specified, the default is the same as the number of map tasks that are specified or the default number of map tasks. For a file source that requires reducer tasks, and no value for num.reduce.tasks is specified, then 80% of the cluster reducer task capacity is used.

You use this property to influence the LOAD operation when dynamically loading into a partitioned table. If the num.reduce.tasks is set to zero, 0, the LOAD is performed using map tasks only.

'max.rejected.records'

Maximum number of rejected records that are allowed before the load process ends with an error. The default is 100.

You cannot use this parameter when you load data from a Teradata data source that uses the Teradata Connector for Hadoop (TDCH)

'tdch.enable' = 'true' or 'false'

If you want to use the Teradata Connector for Hadoop (TDCH) when you use LOAD, then you must use the tdch.enable LOAD option. The default is 'false'.

'tdch.enable'='true'

The Teradata Connector for Hadoop (TDCH) supports the following output formats:

TEXTFILE
RCFILE
SEQUENCEFILE
PARQUET as long as the LOAD properties include the 'tdch.hcat='true' parameter, and the Big SQL table is created by using the STORED AS PARQUETFILE parameter.

To use the Teradata Connector for Hadoop (TDCH), you must follow the installation and setup instructions for the TDCH Sqoop wrapper.

'tdch.hcat' = 'true' or 'false'

If you enabled TDCH, by using the tdch.hcat parameter, you can use the hcatlog when you import the data. When you use hcatalog with the TDCH, you must load into an empty Big SQL table, or the LOAD fails. The default is 'false'.

You must set 'tdch.enable'='true' to use 'tdch.hcat' = 'true'.

Usage notes

For a guide to diagnosing and correcting typical LOAD HADOOP problems, see How to diagnose and correct LOAD HADOOP problems.
If you are loading data into an HBase table in an HDP environment, and HDFS NameNode HA is enabled, you must add the NameNode alias to the /etc/hosts file on all nodes in the cluster.
The SQLCA count is not set by LOAD, so the rows affected that is listed in the SQLCA always indicate 0.
When the LOAD completes successfully, you receive an informational message (SQL5108) with the number of rows loaded in the target, the number of input records, the number of rejected records, and the number of lines skipped. The message contains the Hadoop JOB ID for reference. You can use the JOB ID to find information about the LOAD Hadoop job, and then refer to the map reduce tasks to find information about the rejected records, and additional details about the job in the bigsql.log file. You can use the log entry identifier [BSL-x-xxxxxxxxx] to refer to the load failure in the bigsql.log.
When the LOAD completes with errors, you receive a message that indicate the problem. If the error occurred after the Hadoop job started, the JOB ID is shown. You can use the JOB ID to retrieve the map task logs with additional information.
If the maximum rejected records is exceeded, then the Hadoop job is terminated and no data is loaded. This condition is checked periodically, so it is possible that the job will complete before it can be terminated. In this case the actual number of rejected records might show more then the maximum.

If the number of rejected records exceeds the maximum, then an SQL error message (SQL5109N) is issued
When you use the LOAD HADOOP USING statement, the number of columns that are specified in the source table must be less than or equal to the number of columns in the target table. In general, the columns in the target table should exactly match the columns in the source relational table, such that they have the same name and they are in the same order.
If 'num.map.tasks' is greater than one, and if you specify the SQL QUERY clause, then you must also specify SPLIT COLUMN. If 'num.map.tasks' is more than 1 and if you specify the TABLE clause, then you must also specify either a SPLIT COLUMN or the table must have a primary key. Split columns and primary keys are used by Big SQL to parallelize the load operation.
When you specify table names and column names, the case (upper, lower, mixed) of each character of the name must match the corresponding character with which the name is specified in the source database.
The use of single or double quotations:
Properties key-value pairs
The key needs to be surrounded with single quotes. For string values, surround with single quotes. For numerical values, no quotes are needed.
```
'num.map.tasks' = 20
'date.time.format'='dd-MM-yy' 
```
Identifiers for table or column names

Quotations are optional for table and column names. Each part of the name can be surrounded with double quotes.

String values (URL, QUERY, WHERE)
Use single quotes surrounding the value for the URL, QUERY and WHERE. If a single quote is needed inside the value, then use two single quotes.
```
... WHERE ' CITY=''San Jose''' ...


… SQL QUERY 'SELECT ID, NAME, STATE, CITY FROM CITIES WHERE
              CITY=''San Jose'' AND $CONDITIONS' ...
```
Interpreting data values
BOOLEAN
When you load data from a file, the following case insensitive Boolean string values are interpreted as true:
- “1”
- “yes”
- “true”
- “on”
Any other value is interpreted as false.
DATE and TIMESTAMP

When you load data from a file, the date.time.format pattern can be used to interpret the format of the data.
When you load source files that are in a compression format, the files cannot be split. If the file is not compressed and is on the cluster, then the file can be split by block size. When data is loaded from a remote file system, the source files are not split even if they are larger then the block size. The source files are distributed across the map tasks by file size to evenly distribute the work.
Target table data is compressed based on the following Hadoop compression properties:
- mapreduce.output.fileoutputformat.compress
- mapreduce.output.fileoutputformat.compress.codec
  Set these properties before you run the LOAD HADOOP USING command by using the following Big SQL command:
```
SET HADOOP PROPERTY mapreduce.output.fileoutputformat.compress='true';
SET HADOOP PROPERTY mapreduce.output.fileoutputformat.compress.codec='com.ibm.biginsights.compress.CmxCodex';
```

When you use the Teradata Connector for Hadoop ('tdch.enable'='true'), be aware that certain Teradata native data types are mapped differently during the LOAD. Create the BIG SQL table with column data types that are supported by the Teradata Connector for Hadoop. For example, if the Teradata data source is a DECIMAL, DATE, or ARRAY, the target Big SQL table must be a VARCHAR or STRING data type.

Table 1. How Teradata native data types are mapped to Big SQL by using TDCH
Teradata native data type	Big SQL data type
DECIMAL	VARCHAR or STRING
DATE	VARCHAR or STRING
ARRAY	VARCHAR or STRING or ARRAY
STRUCT	STRUCT or STRING Note: If the Teradata table has a VARCHAR column where the data is in the correct format, it can be imported to a Hive STRUCT and MAP column.
INTEGER	INTEGER
BIGINT	BIGINT
SMALLINT	SMALLINT
TINYINT	TINYINT
FLOAT	FLOAT
DOUBLE	DOUBLE
TIMESTAMP	TIMESTAMP
TIME	STRING
VARCHAR	STRING
CHAR	STRING
BOOLEAN	BOOLEAN

After a LOAD successfully completes, the table automatically runs the ANALYZE command. With the auto analyze feature turned on, which is the default, the following command is run:
```
ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10);
```
If you want to modify the auto analyze after LOAD feature, then set the configuration property biginsights.stats.auto.analyze.post.load to ONCE, NEVER, or ALWAYS in the bigsql-conf.xml file, or set a session property by using SET HADOOP PROPERTY. For more information about the auto analyze feature of ANALYZE, see Enabling or disabling analyze after load.

Restrictions

Restriction: If only the /apps/hbase/data directory is encrypted, INSERT .. SELECT or bulk LOAD INTO HBASE tables does not work.
This LOAD HADOOP USING statement populates Hadoop and HBase tables only. It cannot populate native database tables.
If global impersonation is enabled, user username can run LOAD HADOOP successfully only if the following are true:
- The HDFS directory, /user/<username>/, must exist with READ, WRITE, and EXECUTE permissions for the user.
- <username> must exist on each node of the cluster.
<schemaName>, <tableName>, and <colName> must be specified in the exact case as defined in the database.
You cannot issue concurrent LOAD HADOOP USING statements to load a single table. Running multiple LOAD HADOOP USING statements in parallel against one table results in an error.
When you use the Teradata Connector for Hadoop (TDCH) ('tdch.enable'='true'), the following restrictions apply:
- You cannot LOAD into HBase tables from a Teradata data source by using the Teradata Connector for Hadoop (TDCH). The BigInsights® Teradata connector is used instead.
- You cannot limit the number of rejected records, nor write them to a file. When you have a bad records, the LOAD operation fails.
- You cannot LOAD into a static partition.
- You cannot OVERWRITE previous data. Data is always appended.
- When you load a partitioned table dynamically, the source columns must be returned in the same order as the target Hive table. Use the COLUMNS (…) parameter to order the source columns to match the partitioned Hive table, which means the partitions are the last columns.
A load Hadoop operation does not automatically generate values for the partition-expression column when a table is created with the PARTITIONED BY partition-expression clause.

Adding new JDBC drivers

You can give the load utility access to external JDBC drivers in one of the following ways:

By copying the JDBC driver to the /usr/hdp/4.x.0.0/sqoop/lib directory and restarting the Big SQL service in Ambari.

By setting the bigsql.load.jdbc.jars parameter in the bigsql-conf.xml file. For example:

<property>
  <name>bigsql.load.jdbc.jars</name>
  <value>/tmp/jdbcdrivers</value>
  <description>Directory to store JDBC driver JAR files for LOAD</description>
</property>

By setting the driver.location parameter on the LOAD HADOOP statement. For example:

LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:db2://hostname.company.com:50000/data_base_name'
  WITH PARAMETERS (
    'user'='myname',
    'password'='mypass',
    'driver.location'='/tmp/myjdbcjars'
  )
  FROM TABLE myDB2Table
  SPLIT COLUMN COLA INTO TABLE myBITable APPEND

Mapping source JDBC type to the target Big SQL type

The following table shows the mapping from the source JDBC type to the target Big SQL type. The native data type to JDBC type is JDBC driver dependent.

This mapping applies to database sources. For a delimited source file, all of the fields are in text and are converted to the target data type. The record is rejected if the conversion cannot be done.

Table 2. Mapping from JDBC types to Big SQL types
Source JDBC type	Big SQL data type
INTEGER	INTEGER
VARCHAR LONGVARCHAR NVARCHAR LONGNVARCHAR	VARCHAR
CHAR NCHAR	CHAR
NUMERIC TIME	STRING
DECIMAL	DECIMAL
BOOLEAN BIT	BOOLEAN
TINYINT	TINYINT
SMALLINT	SMALLINT
BIGINT	BIGINT
FLOAT DOUBLE	DOUBLE
REAL	FLOAT
TIMESTAMP DATE	TIMESTAMP
BINARY	BINARY
VARBINARY	VARBINARY
BLOB CLOB LONGVARBINARY ROWID REF DATALINK database specific JDBC types such as INTERVAL, SERIAL ARRAY STRUCT MAP	not supported

Access restrictions to rejected records

Some details about accessing rejected records:

When you run LOAD with rejected.records.dir specified, the rejected records folder and file are created with the following restricted access permissions:

If the rejected records folder does exist prior to executing LOAD, both you and the bigsql user must have READ-WRITE-EXECUTE access to the folder. Otherwise LOAD fails. If you can access the folder, the rejected records file is created with the same permissions as those of the existing folder.
- With impersonation, the ACL entries on the rejected records file always set exclusive READ-WRITE-EXECUTE access to you only (i.e. the user initiating the LOAD).
- Without impersonation (disabled), the ACL entries on the rejected records file set exclusive READ-WRITE-EXECUTE access to bigsql user only. The ACL entries on the rejected records folder remain unchanged.
If the rejected records folder does not exist:
- Without impersonation, when you run LOAD it creates the rejected records folder (specified in rejected.records.dir) with exclusive READ-WRITE-EXECUTE permissions for the bigsql user only. No other user has access to that file or folder. If you require access, you must create the folder manually (outside of LOAD) and set RWX permissions for both yourself and the bigsql user (for example, by setting group access) for the folder. The new rejected records file always inherits the folder permissions and is created with RWX permissions for the bigsql user only.
- With impersonation, when you run LOAD it creates the rejected records folder with READ-WRITE-EXECUTE permissions exclusively for you. No other user has access to the folder and file. The new rejected records file always inherits permissions from the folder. With impersonation, the rejected records is accessible to you after LOAD runs.
With impersonation, the ACL entries on the rejected records folder and file are always set exclusive READ-WRITE-EXECUTE access to you only (with the user initiating the LOAD). Without impersonation (disabled), the ACL entries on the rejected records folder and file are set with exclusive READ-WRITE-EXECUTE access to the bigsql user only.

When you run LOAD and specify rejected.records.dir, the rejected records (column values) are not written into the map-reduce logs. The rejected records are only written into the logs when rejected.records.dir is not specified.

A common practice is for you to create a rejected records folder with RWX permissions for both yourself and the group, and have yourself and bigsql in same group.

Examples

The following examples demonstrate some of the LOAD parameters:

Loading from a DB2 table.

This example shows the SQL QUERY clause.


CREATE HADOOP TABLE STAFF_Q ( 
  ID SMALLINT, NAME VARCHAR(9), 
  DEPT SMALLINT, JOB VARCHAR(5), 
  YEARS SMALLINT, SALARY STRING, COMM STRING);

 
LOAD HADOOP USING 
  JDBC CONNECTION URL 
    'jdbc:db2://myhost:50000 /SAMPLE'
  WITH PARAMETERS (
     'user' = 'myuser','password'='mypassword')
  FROM SQL QUERY 
     'SELECT * FROM STAFF WHERE YEARS > 5 
        AND $CONDITIONS'
        SPLIT COLUMN ID INTO TABLE STAFF_Q  
  APPEND
;

Loading from a DB2 table into a partitioned Big SQL table.

This example shows the SPLIT COLUMN clause. It uses target table properties and multiple load properties.


CREATE HADOOP TABLE staff ( 
  id SMALLINT , name VARCHAR (9), job VARCHAR(5)) 
  partitioned by (years SMALLINT ); 
LOAD HADOOP USING 
  JDBC CONNECTION URL 
    'jdbc:db2://myhost:50000 /SAMPLE'
  
  FROM TABLE STAFF 
   COLUMNS(ID,NAME,JOB,YEARS)
      WHERE 'YEARS is not null' 
      SPLIT COLUMN ID INTO TABLE staff APPEND
  WITH TARGET TABLE PROPERTIES (
    'delimited.drop.loaddelims'='true')
  WITH LOAD PROPERTIES (
    'max.rejected.records'=50, 
    'num.map.tasks' = 2)
;

Load from a file that is on the distributed file system in the directory /tmp/data/:


CREATE HADOOP TABLE STAFF_F ( 
  ID SMALLINT, NAME VARCHAR(9), 
  DEPT SMALLINT, JOB VARCHAR(5),
  YEARS SMALLINT, SALARY VARCHAR(20), 
  COMM VARCHAR(20))
;

LOAD HADOOP USING 
  FILE URL '/tmp/data/staff.csv' 
  WITH SOURCE PROPERTIES('field.delimiter'=':') 
  INTO TABLE TEST.STAFF_F APPEND
;

If no protocol is used in the URL value, the path is assumed to be a distributed file systm (DFS) file.

Loading from a teradata table.

This example shows how to load a table from Teradata. The example assumes that table country_info exists.

LOAD HADOOP USING JDBC CONNECTION 
 URL 
'jdbc:teradata://myhost/database=GOSALES' 
 WITH PARAMETERS (
   'user' ='myuser',password='mypass')  
FROM 
 SQL QUERY 
   'SELECT * FROM COUNTRY 
    WHERE SALESCOUNTRYCODE > 6 AND $CONDITIONS' 
 SPLIT COLUMN SALESCOUNTRYCODE  
INTO TABLE 
 country_info  APPEND;

Load from the contents of a directory.

This example shows how to load the contents of files in a directory that is part of your local file system into a Big SQL table:

LOAD HADOOP USING FILE URL 
'sftp://biadmin:biadmin@myserver.abc.com:22/home/biadmin/mydir'
WITH SOURCE PROPERTIES (
  'field.delimiter'=' ', 
  'file.charset'='UTF-8', 
  'skip.lines.count'='1')
INTO TABLE staff_table 
PARTITION ( years='10') APPEND
;

Load from a database source that does not have the installed drivers.

In this example, you must explicitly specify the driver name.


LOAD HADOOP USING JDBC CONNECTION URL
  'jdbc:informix-sqli://myhost:1550/TESTDB:INFORMIXSERVER=infxservername;DELIMIDENT=y;'        
WITH PARAMETERS (
   'user'='dbuser','password'='dbpw',
   'driver'='com.informix.jdbc.IfxDriver'
  )  
FROM SQL QUERY
    'select * from a_table 
     WHERE $CONDITIONS' SPLIT COLUMN  a_key INTO 
     TABLE a_table_infx APPEND;

Load from a file and include a path for the rejected records.

LOAD HADOOP USING FILE 
URL 'sftp://user1:password1@host1/path/to/file.data'
WITH SOURCE PROPERTIES 
  ( 
  'date.time.format' = 'yyyy-MM-dd HH:mm:ss', 
  'date.time.format' = 'yyyy-MM-dd',
  'field.delimiter'=',', 'skip.lines.count'=1)
INTO TABLE base_staff1 
APPEND
WITH TARGET TABLE PROPERTIES
   ('column.names' = 'c1,c2')
WITH LOAD PROPERTIES 
   ('num.map.tasks' = 1, 
    'max.rejected.records'=500,
    'rejected.records.dir' = '/temp/reject_records')
;

Load from a file on the distributed file system and insert a character string in place of the null value.

LOAD HADOOP USING FILE URL 
  '/tmp/input/staff_na.csv'
WITH SOURCE PROPERTIES (
  'replace.with.null'='NA', 
  'ignore.extra.fields'='true')
INTO TABLE staff_file OVERWRITE
WITH LOAD PROPERTIES ('num.map.tasks' = 1);

Granting privileges to LOAD.
If you do not have database administrative authority, you cannot load data into a table. If you created the table, you automatically have the proper authority. If you are the DBADMIN, you can grant authority to other users by using the following command:
```
GRANT INSERT, SELECT, DELETE ON myschema.SALES TO USER user1
```

Dynamically loading into a partitioned table.

This example adjusts both the number of map tasks and the number of reduce tasks.


LOAD HADOOP USING FILE 
  URL '/tmp/lineitem/' 
  WITH SOURCE PROPERTIES (
   'field.delimiter'='|', 
   'ignore.extra.fields'='true',
   'field.indexes'='1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,11'
   )
INTO TABLE lineitem_parquet
   APPEND 
   WITH LOAD PROPERTIES (
   'num.map.tasks'='20', 
   'num.reduce.tasks' = '10'
    )
;

Load from a directory into an HBASE table by using the PUT method.

In this example, you use the PUT load method, which is compatible with the property to disable write-ahead logging. You specify the column.names property to indicate where the data is loaded in the target table. By using multiple date.time.format source properties, LOAD can recognize different date format values in the input.


LOAD HADOOP USING FILE URL  '/input/orderdata' 
  WITH SOURCE PROPERTIES   (
    ‘date.time.format' = ‘yyyy-MM-dd hh:mm:ss’,   
    ‘date.time.format’ = ‘yy-MM-dd’
  )
INTO TABLE staff 
 WITH TARGET TABLE PROPERTIES  (
  ‘hbase.load.method’ = 'put',
  ‘hbase.disable.wal’ = 'true',
  ‘column.names’ = ‘id,name,purchasedate’
  )
 WITH LOAD PROPERTIES  (
   'num.map.tasks' = 10, 
   'max.rejected.records'=500,
   'rejected.records.dir' = '/tmp/reject_records'
  )
;

Load from a Teradata table into an HBase table by using target table properties and load properties.

This example shows how to load a table from Teradata. The statement specifies a list of table properties and load properties in HBase.


CREATE HBASE TABLE country_hbase (
salescountrycode INTEGER, country VARCHAR(50),currencyname VARCHAR(50))  
COLUMN MAPPING (
 KEY MAPPED BY (salescountrycode),
 cf:country MAPPED BY (country),
 cf:currencyname MAPPED BY (currencyname)
 );
  
LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:teradata://myhost/database=GOSALES' 
   WITH PARAMETERS (user ='myuser',password='mypassword')
 FROM TABLE 'GOSALES'.'COUNTRY' COLUMNS (SALESCOUNTRYCODE, COUNTRY, CURRENCYNAME) 
    WHERE 'SALESCOUNTRYCODE > 9' SPLIT COLUMN 'SALESCOUNTRYCODE'
 INTO TABLE country_hbase 
WITH TARGET TABLE PROPERTIES(
  'hbase.timestamp' = 1366221230799, 
  'hbase.load.method'='bulkload'
  )
WITH LOAD PROPERTIES (
  'num.map.tasks' = 10,
  'num.reduce.tasks'=5
)
;

Load HBase data into a Hive table that was defined with the HBase storage handler:

This example begins by creating a table with the HBase storage handler and the hbase.columns.mapping parameters:


CREATE HADOOP TABLE staff_hbase ( 
  id SMALLINT, name VARCHAR(100), dept SMALLINT, 
  job STRING, years SMALLINT, salary STRING, comm STRING
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');

Then, you can load the HBase data by using the hbase.load.method parameter:


LOAD HADOOP USING FILE URL '/testdata/staff'
 WITH SOURCE PROPERTIES (
  'field.delimiter' = ':'
  ) 
INTO TABLE staff_hbase 
  WITH TARGET TABLE PROPERTIES (
   'hbase.load.method' = 'bulkload'
   )
;

Load data from a Teradata data source by using the Teradata Connector for Hadoop (TDCH):


LOAD HADOOP USING JDBC CONNECTION URL
  'jdbc:teradata://153.64.73.10/database=GOSALES'
   WITH PARAMETERS  (
     'user'='gosales',
     'password'='gosales'
     )
 FROM TABLE COUNTRY 
   COLUMNS (SALESCOUNTRYCODE, COUNTRY, ISOTWOLETTERCODE,
       ISOTHREEDIGITCODE, CURRENCYNAME,  EUROINUSESINCE,  ISOTHREELETTERCODE
     )
 WHERE 'ISOTHREELETTERCODE IS NOT NULL'
    SPLIT COLUMN SALESCOUNTRYCODE  INTO TABLE  tdcountry_p_3iso 
 APPEND  
 WITH LOAD PROPERTIES
    ('tdch.enable'='true')
;

Load from a file and include a control for matching the precision and scale of DECIMAL values:

LOAD HADOOP USING FILE 
URL '/tmp/file.data'
WITH SOURCE PROPERTIES 
  ( 
  'date.time.format' = 'yyyy-MM-dd HH:mm:ss', 
  'date.time.format' = 'yyyy-MM-dd',
  'field.delimiter'=',', 'skip.lines.count'=1)
INTO TABLE base_staff1 
APPEND
WITH TARGET TABLE PROPERTIES
   ('column.names' = 'c1,c2', 
    'decimal.overflow'='round_up')
WITH LOAD PROPERTIES 
   ('num.map.tasks' = 1, 
    'max.rejected.records'=500,
    'rejected.records.dir' = '/temp/reject_records')
;

Load data into a table that contains an ARRAY and a ROW data type.
Define the table with three columns. Column id is an INTEGER. Column c1 is an array of INTEGERs. Column c2 is a ROW type that contains an INTEGER data type and a VARCHAR data type. The LOCATION keyword defines a table with pre-existing data.
```
CREATE HADOOP TABLE t1(
  id INT, c1 INT ARRAY[4], c2 ROW(f1 int, f2 VARCHAR(8))
  ) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY '|' 
LOCATION '/users/user1/tables/t1'
;
```
The data is in file table1.csv, and contains the following contents:
```
0,0|1|1|2,25|Data0025 
1,1|1|2|3,30|Data0030 
2,2|3|5|8,50|Data0050
```
Run the LOAD statement to load the data into the appropriate columns:
```
LOAD HADOOP USING FILE URL
  'sftp://biadmin:biadmin@myserver.com:22//home/user1/data/table1.csv'
WITH SOURCE PROPERTIES (
  'field.delimiter' = ',',
  'file.charset' ='UTF-8'
)
INTO HIVE TABLE t1
APPEND
WITH LOAD PROPERTIES (
  'num.map.tasks' = 1
)
;
```
Get the results from the id column, the fourth entry in the array from the column c1, and the field called f2 from the column c2.
```
SELECT id, c1[4], c2.f2 FROM t1
```

Load a table by using the "round_up" property to match the target precision and scale.


CREATE HADOOP TABLE hvalidate_data_tunc_roundup_nn (
  DEC52_COL DECIMAL(5,2), 
  VAR3_COL VARCHAR(3), 
  CHAR1NN_COL VARCHAR(1) NOT NULL, 
  DEC100_COL DECIMAL
);

LOAD HADOOP USING FILE 
  URL '/tmp/data/validate_data.txt' 
  INTO TABLE hvalidate_data_tunc_roundup_nn 
APPEND 
WITH TARGET TABLE PROPERTIES(
  'decimal.overflow'='round_up', 
   
  'reject.nulls'='true'
)
;

Load from a delimited data file into a table that is defined with a ROW data type. The fields of the ROW are delimited by a colon (:).

Create the table:


CREATE HADOOP TABLE customers( 
   name VARCHAR(10), 
   address ROW(
      street VARCHAR(30), 
      city VARCHAR(20),
      zip INT
      )
  )
;

The data to input looks like the following text:


John Doe,  "555, Bailey Avenue:San Jose:95148"
Daniel Smith , "1001 First Avenue:New York:10022"

Then, load the data:


LOAD HADOOP USING FILE URL '/testdata/customers.csv'  
WITH SOURCE PROPERTIES (
 'collection.items.delimiter' = ':' 
)
INTO TABLE customers;

Load from a delimited data file into a table defined using an ARRAY data type.

Create the three-column table. The third column is an ARRAY of VARCHAR data.


CREATE HADOOP TABLE product_usage( 
  customer_name VARCHAR(100),
  product_name VARCHAR(100),
  versions VARCHAR(4) ARRAY[3]);

The following is the data that will be loaded:


Aerospace systems, BigInsights,2.1 | 3.0 | 3.1
Advanced Technologies, DB2,10.1 | 10.2 | 9.1

Load the data:


LOAD HADOOP USING FILE URL '/testdata/product_usage.csv'
WITH SOURCE PROPERTIES (
 'collection.items.delimiter' = '|' 
)
INTO TABLE product_usage
WITH TARGET TABLE PROPERTIES (
 'array.overflow' = 'keep' 
)
;

Load from a delimited data file into a table. The table is defined with ROW data type nested inside of an associative ARRAY data type.
Create the table with two columns. The first column is a VARCHAR(100), and the second columns is an ARRAY of ROWs.
```
CREATE HADOOP TABLE social_data (
 personname VARCHAR(100), 
 FRIENDS ROW(CITY STRING, ZIP INT, EMAIL STRING) ARRAY[VARCHAR(20)]);
```
The following text is the data that is loaded.
```
John, Mike:New York#10001#mike@example.com| Smith: Denver#80123#smith@yahoo.com
David, Jennifer: San Francisco#94102#jenn@example.com 
```
The personname value in the first row is John. The following flow describes the data in terms of the SOURCE PROPERTIES that will be used in the LOAD statement:

field.delimiter

The columns or fields in the data are separated by a comma for the purposes of the LOAD statement.

collection.items.delimiter

The second column contains the FRIENDS of John. This second column is a collection of friends that includes a collection of information about the friends. Each item or friend is separated by the '|' symbol for the purposes of the LOAD statement.

map.keys.delimiter

Each item or friend begins with a key, because this table defines the collection as an associative ARRAY. The data shows that John has 2 friends: Mike and Smith. The keys are separated from the remainder of the information by a colon (:).

collection.items.delimiter

The second collection contains items within it that provide information about the friends. The items in the second collection, defined as CITY, ZIP, and EMAIL, are separated from each other by a pound sign (#).
Load the data. Notice the order of the SOUCE PROPERTIES, where the first property is the outermost reference, and the last property is the innermost reference.
```
LOAD HADOOP USING FILE URL '/testdata/social_data.csv'
  WITH SOURCE PROPERTIES(
 'field.delimiter' = ',' , 
 'collection.items.delimiter' = '|' ,
 'map.keys.delimiter' = ':' , 
 'collection.items.delimiter' = '#' 
)
INTO TABLE social_data
;
```
Remember that you cannot select a whole ARRAY or ROW column, but you can select results in several different ways:
1. Return the elements of the ROW:
```
SELECT personname, FRIENDS[1].CITY, FRIENDS[1].ZIP, FRIENDS[1].EMAIL FROM social_data;
```
2. Return all of the elements in the ARRAY:
```
SELECT personname, FRIEND.CITY, FRIEND.ZIP, FRIEND.EMAIL FROM social_data, UNNEST(FRIENDS) AS F(FRIEND);
```
3. Return elements based on predicates:
```
SELECT personname, FRIENDS[1].CITY, FRIENDS[1].ZIP, FRIENDS[1].EMAIL FROM social_data 
  WHERE FRIENDS[1].CITY = 'New York';
```

CMX compression is supported on Big SQL LOAD. The following example uses CMX on the LOAD after setting the compression properties:


SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress'='true';
SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress.type'='BLOCK';
SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress.codec'='org.apache.hadoop.io.compress.CmxCodex';


LOAD HADOOP USING FILE URL
 'tmp/example.csv' 
  WITH SOURCE PROPERTIES ('replace.with.null'='NULL')
  INTO HIVE TABLE t1
  APPEND
;

Change the auto analyze feature, so that after a successful LOAD statement, ANALYZE is never run:
```
SET HADOOP PROPERTY biginsights.stats.auto.analyze.post.load=NEVER;
```