Flight data requests
You can specify access to a data asset by using metadata objects called data requests. A data request is the representation of the generic open source pyarrow.flight.FlightDescriptor
. It is a metadata document that describes a request
to retrieve or generate a data set. The data request can contain credentials or connection properties, or it can point to a connected data asset ID or connection ID.
In notebooks, you can use generated code to load data from a data source. The generated code that is added to a notebook cell is a method to help you work with data from a data source.
The generated code uses Flight service based on Apache Arrow Flight to communicate with a file, database connection, or connected data asset (data accessible through a connection) when loading data into a data structure.
By using a flight data request, you can specify the data source and interactive properties for read or write requests.
The following sections provide more details about the syntax and properties in Flight data requests for 's Flight service.
Data request syntax and properties
To find out which interaction properties are supported by which connector, see Data and AI Common Core API. You can also call
the v2/datasources_types
endpoint to Data and AI Common Core API list defined types of data source.
A data request has the following structure. Actual instances of data requests typically do not include all of the listed properties and might use other interaction properties that are specific to a certain data source.
Example of a request to /v2/datasource_types
:
curl --request GET --url 'https://<host:port>/v2/datasource_types?connection_properties=true&interaction_properties=true' --header 'Accept: application/json' --header 'Authorization: Bearer ${TOKEN}'
Example of a request to /v2/connections
:
curl --request GET --url 'https://<host:port>/v2/connections?entity.name=<your_connection_name>' --header 'Accept: application/json' --header 'Authorization: Bearer ${TOKEN}'
Data request object syntax
The following data request shows the syntax in Python notation. The description is also valid for R.
data_request = {
# The Flight Service expects assets to be identified by their GUID.
# "asset_id" is the ID of connected data asset or a connection asset.
"asset_id": "ASSET_GUID",
"project_id|space_id|catalog_id": "GUID",
# itc_utils.flight_service adds support for using asset names
# use either one of ...
"data_name" : "name of a plain data asset",
"connection_name" : "name of a connection asset",
"connected_data_name" : "name of a 'connected data' asset",
# by default the assets will be looked up in the current project
# itc_utils will set "asset_id" from the provided name.
# "asset_id" and "connection_properties" are mutually exclusive
"connection_properties": { # optional
# not needed in most cases
},
"interaction_properties": {
# commonly used properties, all optional
"file_name" : "file in a connected data source",
"schema_name" : "schema of a table in a connected database",
"table_name" : "name of a table in a connected database",
"row_limit" : NROWS,
"infer_schema" : false|true,
"invalid_data_handling" : "fail|row|column",
...
},
"fields": [...] # optional
"num_partitions": N, # optional
"batch_size": NROWS, # optional
"commit_frequency" : NROWS # optional
}
Supported properties, values, and descriptions
Property | Value | Description |
---|---|---|
asset_id |
The globally unique identifier (GUID) of a data asset, connection asset, or connected data asset in the project, catalog, or deployment space. | |
batch_size |
A positive integer less than or equal to 100000. The default is 10000. | Flight reads rows from databases in chunks, internally managed in objects of type RecordBatch . The batch_size sets the number of rows to read from the source per chunk. Setting a low batch size can decrease performance
by reading from the source too often. |
commit_frequency |
A valid value is greater than 100 but less than 100000. The default is 100. | Automatically commits the database transaction after writing n rows. Only used when writing data to an SQL database. |
connection_properties |
The connection_properties are not used in most cases, especially not in notebooks, where using connection_properties can increase the risk of leaking credentials. You would use connection_properties if you haven't added any project assets, but you want to access data from or write data to an external data source. If the data is in an external data source, specify the properties of the connection in a connection asset and refer to
this connection in the data request. |
|
interaction_properties |
The interaction_properties depend on the data source type. For a list of supported properties per data source, see [Data and AI Common Core API Datasource types] list defined types of data source. |
|
interaction_properties.row_limit |
The value can be a positive number. | If the row_limit property is not specified, all rows will be read. This property limits the number of rows read from the source. Generally, you can start with a limit when exploring data. Otherwise, the runtime might run out of memory.
Setting row_limit usually causes any value for num_partitions to be ignored, defaulting to 1 instead. |
interaction_properties.infer_schema |
A Boolean value to request that the Flight service infer a type. The default is False . |
Some data sources, such as CSV files, do not provide data types for columns. All columns will be treated as strings by default. You can pass "infer_schema": true and the Flight service will read the file and make
a best guess as to the field formats. Flight service reads only the first 1000 rows, so it might infer incorrectly if a field does not have a good mix of the available properties in those first 1000 rows. For example, by reading 0 or
1 in the first 1000 rows, Flight might conclude that a column is Boolean when it must be integer because later rows have values greater than 1. |
interaction_properties.invalid_data_handling |
One of fail or row or column . The default is fail . |
Defines how to handle values that are not valid, for example fail the job, null the column, or drop the row. |
num_partitions |
A positive integer. The default is 1. | Some data sources can provide access to data by using multiple endpoints. This technique can be used to read multiple streams in parallel threads. The value of num_partitions defines how many endpoints the Flight service is
expected to provide to the client. The value of num_partitions is ignored if the data source does not support multiple endpoints. The value might also get ignored if it conflicts with other attributes such as row_limit . |
project_id |
The GUID of your project. | |
fields |
fields are written as a list of dictionaries, which must include, at a minimum, the name and type properties. For more information on fields , see fields property. |
Used when reading data, it is not used when writing data. |
fields
property
The fields
property enumerates the fields, also referred to as columns, to be selected from a data source. When reading from a CSV or delimited file, this list of fields defines the names and types of the fields in the file. The
layout of the values in the file must exactly match the list of fields. Each field must specify both the name and the data type.
Supported values for the type
attribute in fields
are (names are not case-sensitive): ARRAY, BIGINT, BINARY, BIT, BLOB, BOOLEAN, CHAR, CLOB, DATALINK, DATE, DECIMAL, DISTINCT, DOUBLE, FLOAT, INTEGER, JAVA_OBJECT, LONGNVARCHAR,
LONGVARBINARY, LONGVARCHAR, NCHAR, NCLOB, NULL, NUMERIC, NVARCHAR, OTHER, REAL, REF, REF_CURSOR, ROWID, SMALLINT, SQLXML, STRUCT, TIME, TIME_WITH_TIMEZONE, TIMESTAMP, TIMESTAMP_WITH_TIMEZONE, TINYINT, VARBINARY, VARCHAR, VECTOR
The values
for the type
attribute in fields
are the names that are used by the Flight service. When data is read, these types are mapped to the data types used in open source Arrow Flight. For more information, see Data Types and In-Memory Data Model. For example, TINYINT gets mapped to int8()
and BIGINT to int64()
.
However, with some data sources, for example when reading from a Parquet file or from a table in an external SQL database, the list of fields can specify a subset of the columns to be read from the source.
Flight service won't read other columns. By default, all columns of the table would be retrieved. Each field must specify both the name and the data type. The type can override the type of the column in the external table. For example, a column
that is stored as VARCHAR(10)
can be mapped to bigint
in Flight. The fields
property is not used when writing to a database; instead, Flight service uses the schema that is provided in do_put(...)
.
The following example shows how to use the fields
property in Python notation. However, the description is also valid for R.
```json {: .codeblock}
"fields": [{"name":"a","type":{"type":"bigint"}},
{"name":"b","type":{"type":"varchar"}}
]
```
- : The value of the type attribute is another dictionary with another type attribute.
-
If the
fields
property is provided in the data request, theinfer_schema
property is implicitly set to false because thefields
property already defines the data types. - :
-
If you use pandas DataFrames and want to check the mapping between data types in Flight and pandas, see Type differences.
-
The
fields
property specifies the name and data type of a column and more details. The following example shows the structure of thefields
template in Python notation. However, the description is also valid for R.{ "name": "fieldname", "type": { "type": "typename", "length": L , # integer, depends on type "scale": S, # integer depends on type, only numeric and decimal "signed": true or false # for numeric types } }
-
The combination of
length
andscale
defines the precision and scale of a numeric or decimal data type. -
The property
signed
influences the type and range of values that are considered to be valid. -
{"type":"TINYINT","signed":true}
is mapped to the typeint8()
in Flight. Valid values are-128 <= signed tinyint <= 127
.
-
{"type":"TINYINT","signed":False}
is mapped to the typeuint8()
in Flight. Valid values are0 <= unsigned tinyint <= 255
. |
Learn more
- Examples of Flight service data requests with Python
- Examples of Flight service data requests with R
Parent topic: Accessing data sources with Flight service