Expressions in Pipeline and Stage Properties
- Spark SQL query language
- Spark SQL is the relational query language used with Spark. Because processing for Transformer pipelines occurs on a Spark cluster, you must use Spark SQL for all expressions that manipulate pipeline data.
- StreamSets expression language
- The StreamSets expression language is based on the JSP 2.0 expression language. If you use Data Collector or Control Hub, you are probably familiar with the StreamSets expression language.
Referencing Fields in Spark SQL Expressions
To reference a first-level field in a record in a Spark SQL expression, you simply specify the field name. Transformer does not perform the case-sensitive evaluation of field names within a pipeline.
For example, to deduplicate data based on an ID
field, you configure a
Deduplicate processor to deduplicate based on fields. Then, you can specify
ID
, Id
, iD
, or
id
as the field to use.
.
) to specify
the path to the field, as
follows:<top level>.<next level>.<next level>.<field to use>
For example, customer.transactions.2019
.
To reference an item in a List field, use bracket notation ([#]
) to
indicate the position in a list. Use 0 to indicate the first item in the list, 1 to
indicate the second, and so on.
For example, to reference the second item in an appt_date
List field,
enter appt_date[1]
.