Hive User-defined functions
The Db2 Big SQL environment in Hadoop includes the Hive user-defined functions package. This set of functions is an optional package that you can install to use some of the Hive open source user-defined functions in your Db2 Big SQL queries. You can also use the package as an example to implement and use other functions that might not be included in the package. The Hive user-defined functions are part of the HIVE schema.
Installing the Hive user-defined functions package
Thes Hive user-defined functions provide LANGUAGE SQL (inline) and LANGUAGE C implementations for
some common Hive built-in functions.
Note: The functions and procedures are provided
"AS-IS" without warranty of any kind.
Do the following steps to extract and install the package:
- In the Linux command line, log in as the Db2 Big SQL administrator, and provide the Db2 Big SQL
administrator password:
sudo -u bigsql bash
- In the Linux command line, change the directory to the location of the Hive
UDFs:
cd $BIGSQL_HOME/hive-compat
- Extract the $BIGSQL_HOME/hive-compat/bigsql-hiveudf-*.tar.gz file.
tar -zxvf $BIGSQL_HOME/hive-compat/bigsql-hiveudf-*.tar.gz
- Change the directory to the extracted bigsql-hiveudf
directory.
cd bigsql-hiveudf
- Run the following script as user bigsql, or the Db2 Big SQL administrator, to
install the Hive UDFs in your Db2 Big SQL environment:
sh ./install-hiveudfs.sh
List of available Hive user-defined functions
The following Hive built-in functions are supported in Db2 Big SQL:
Function Name | Arguments | Return type |
---|---|---|
datediff |
|
int |
date_add |
|
varchar(50) |
date_sub |
|
varchar(50) |
decode |
|
varchar(50) |
e | no arguments | double |
encode |
|
varbinary(50) |
format_number |
|
varchar(50) |
from_unixtime |
|
varchar(50) |
from_utc_timestamp | This function is available as a built-in function in the SYSIBM schema. For more information about this function, see FROM_UTC_TIMESTAMP scalar function. | varchar(50) |
get_json_object |
|
varchar(32672) |
log |
|
double |
log2 |
|
double |
parse_url |
|
varchar(32672) |
pi | no arguments | double |
to_utc_timestamp | This function is available as a built-in function in the SYSIBM schema. For more information about this function, see TO_UTC_TIMESTAMP scalar function. | varchar(50) |
unix_timestamp |
|
bigint |
Examples
- datediff
- The result of the function is the number of days from start_date to
end_date as an integer. In this example, the column
introduction_date is a VARCHAR(15). The query returns those rows where the
number of days is greater than
30.
SELECT * FROM PRODUCT WHERE HIVE.DATEDIFF('2014-06-1',introduction_date) > 30;
- date_add
- The result of the function adds a number of days to start_date. In this
example, the column introduction_date is a VARCHAR(15). The function returns
the adjusted
date.
SELECT HIVE.DATE_ADD(introduction_date,1),introduction_date FROM PRODUCT;
- date_sub
- The result of the function subtracts a number of days from start_date. In
this example, the column introduction_date is a VARCHAR(15). The function
returns the adjusted
date.
SELECT HIVE.DATE_SUB(introduction_date,1),introduction_date FROM PRODUCT;
- decode
- This function decodes the first argument into a STRING by using one of the following character sets:
- 'US_ASCII'
- 'ISO-8859-1'
- 'UTF-8'
- 'UTF-16BE'
- 'UTF-16LE'
- 'UTF-16'
The output:SELECT hive.decode(bx'616263','UTF-8') FROM SYSIBM.SYSDUMMY1;
'abc'
The output:SELECT hive.decode(bx'006100620063','UTF-16BE') FROM SYSIBM.SYSDUMMY1;
'abc'
- e
- This function returns the constant that is the base of the natural logorithm.
SELECT number, power(HIVE.e(), hive.log(HIVE.e(), number)) FROM numberTable;
- encode
- This function encodes the first argument into a BINARY by using one of the following character sets:
- 'US_ASCII'
- 'ISO-8859-1'
- 'UTF-8'
- 'UTF-16BE'
- 'UTF-16LE'
- 'UTF-16'
The output:SELECT HIVE.ENCODE('Jürgen','UTF-8') FROM sysibm.sysdummy1;
'0x4AC3BC7267656E'
The output:SELECT HIVE.ENCODE('Jürgen','UTF-16BE') FROM sysibm.sysdummy1;
'0x004A00FC007200670065006E'
- format_number
- This function formats the number in the first expression to a format like '#,###,###.##',
rounded to the number of decimal places in the second expression. It returns the result as a STRING.
If the second expression is 0, the result has no decimal point or fractional part.
The output:SELECT hive.format_number(12332.123456, 4), hive.format_number(12332.1,4), hive.format_number(12332.2,0) FROM sysibm.sysdummy1;
'12,332.1235' | '12,332.1000' | '12,332'
- from_unixtime
- This function converts the number of seconds from unix epoch (1970-01-01
00:00:00 UTC) to a STRING that represents the TIMESTAMP of that moment in the current system time
zone in the format of "1970-01-01 00:00:00". The following example returns the current date
including the time.
SELECT HIVE.FROM_UNIXTIME(HIVE.UNIX_TIMESTAMP()) FROM SYSIBM.SYSDUMMY1;
- get_json_object
- This function extracts the JSON object from a JSON string (the first expression) that is based
on the JSON path specified in the second expression. It returns a JSON string of the extracted JSON
object. The result is NULL if the input JSON string is invalid. Because of Hive column name
restrictions, the only valid characters in the JSON path are the characters
[0-9a-z_]. Also, the keys cannot start with numbers. In the following example, the
function reads the json_table.json from the json_table and
extracts the fields Foo, Bar, Quuz.QuuxId, and Quux.QuuxName from each JSON
object.
The output is:SELECT HIVE.GET_JSON_OBJECT(json_table.json, '$.Foo') as foo, HIVE.GET_JSON_OBJECT(json_table.json, '$.Bar') as bar, HIVE.GET_JSON_OBJECT(json_table.json, '$.Quux.QuuxId') as qid, HIVE.GET_JSON_OBJECT(json_table.json, '$.Quux.QuuxName') as qname FROM json_table;
foo bar qid qname ABC 20090101100000 1234 Sam
- log
- This function returns the base logarithm of the argument in the second expression.
The output looks like the following:SELECT HIVE.LOG(3, 66) FROM sysibm.sysdummy1 ;
+3.81358809221559E+000
- log2
- This function returns the base-2 logarithm of the
argument.
The output looks like the following:SELECT HIVE.LOG2(44) FROM sysibm.sysdummy1 ;
+5.45943161863730E+000
- pi
- This function returns the constant pi that is used in calculating the area of
circle.
SELECT (hive.pi()*radius*radius) from radiiTable;
- parse_url
- This function returns the specified part from the URL. The valid values for the second
expression, partToExtract, include the following values:
- HOST
- PATH
- QUERY
- REF
- PROTOCOL
- AUTHORITY
- FILE
- USERINFO
The following example extracts the value of a particular key in the QUERY, which is anything after the question mark, and returns the value 'v1'.SELECT HIVE.PARSE_URL('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') FROM sysibm.sysdummy1;
SELECT HIVE.PARSE_URL('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') FROM sysibm.sysdummy1;
- unix_timestamp
- This function converts the date to the specified date format and returns the number of seconds
between the specified date and Unix epoch. If it fails, then it returns 0. The following example
returns the value
1237532400
SELECT hive.unix_timestamp('2009-03-20', 'yyyy-MM-dd') FROM SYSIBM.SYSDUMMY1;