Schema Generator
The Schema Generator processor generates a schema based on the structure of a record and writes the schema into a record header attribute. The Schema Generator processor generates Avro and Parquet schemas at this time.
Use the Schema Generator processor to generate a basic schema when the schema is unknown. For example, you might use the processor in a pipeline to generate the latest version of the Avro schema before writing records to destination systems.
When you configure a Schema Generator processor, you can specify the namespace and description for the schema. You can specify whether schema fields should allow nulls and whether schema fields should default to null. You can specify default values for most Avro primitive types, and you can allow the processor to use a larger data type for types without a direct equivalent.
You can specify the names for precision and scale attributes for decimal values. And you can configure a default precision and scale for any decimal fields without that information or with an invalid precision or scale.
When appropriate, you can configure the Schema Generator to cache a number of schemas, and to apply the schemas to records based on the expression defined in the Cache Key Expression property.
Using Schema Header Attributes
The Schema Generator processor writes Avro schemas to an avroSchema record header attribute and Parquet schemas to a parquetSchema record header attribute by default. Any destination that writes Avro data can use the schema in the avroSchema header attribute and any destination that writes Parquet data can use the schema in the parquetSchema header attribute. All Avro-processing origins also write the Avro schema of incoming records to the avroSchema header attribute.
When processing Avro or Parquet data, one logical workflow is to add the Schema Generator immediately before the destination in a pipeline. This allows the processor to generate a new schema before writing the data to destination systems.
If you want retain an earlier version of the schema, you might use an Expression Evaluator processor before the Schema Generator to move the existing schema in the schema header attribute to a different header attribute, such as avroSchema_previous.
Generated Schemas
The Schema Generator can generate schemas with the following information:
- Avro schemas
- The Avro schema that the Schema Generator creates includes the following
information:
- Schema type set to
record
. - Schema name based on the Schema Name property.
- Namespace based on the Namespace property, when configured.
- Schema description in the doc field based on the Doc property, when configured.
- A map of field names with related attributes based on the record schema and related properties defined in the stage, such as whether fields can include null values.
For example, the following Avro schema is generated when you set the Name property to MyAvroSchema, and omit the optional Namespace and Doc properties:
{"type":"record","name":"MyAvroSchema","namespace":"","doc":"","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"id","type":["null","int"],"default":null},{"name":"instock","type":["null","boolean"],"default":false},{"name":"cost","type":["null",{"type":"bytes","logicalType":"decimal","precision":10,"scale":2}],"default":null}]}
The record described by this schema includes the following fields:name
- A string field.id
- An integer field.instock
- A boolean field.cost
- A decimal field.
The processor is configured to allow nulls in schema fields and to use null as the default value.
- Schema type set to
- Parquet schemas
- The Parquet schema that the Schema Generator creates includes the following
information:
- Schema name based on the Schema Name property.
- Namespace based on the Namespace property, when configured.
For example, the following Parquet schema is generated when you set the Name property to exampleSchemaName and the Namespace to exampleNamespace:
message exampleNamespace.exampleSchemaName { optional binary name (UTF8); optional int32 id; optional boolean instock; optional binary cost (DECIMAL(10,2)); }
The record described by this schema includes the following fields:name
- A string field.id
- An integer field.instock
- A boolean field.cost
- A decimal field.
Caching Schemas
You can configure the Schema Generator to cache a number of schemas, and to apply the schemas to records based on the expression defined in the Cache Key Expression property.
Caching schemas can improve performance when a set of records can logically use the exact same schema, and when the records include a value that can be used to determine the schema to use.
For example, say your pipeline uses the JDBC Multitable Consumer to read from multiple database tables. The origin writes the names of the table used to generate each record to a jdbc.tables record header attribute. Let's assume that all data from each record comes from a single table.
To use the schema associated with each record, you can configure the Cache Key Expression
property as follows: ${record:attribute(jdbc.tables)}
.