Datasets
With Databand, you can track datasets for your runs.
Using dataset_op_logger
Using the Databand dataset_op_logger
context manager, you can log numerous
attributes about the datasets that are processed by your Python code. For example, you can log the
dataset shape, schema, column-level stats, and whether the dataset operation was successful. To
track datasets for your run:
- Create the code in which you connect to the Databand environment to enable tracking, see Python.
- Add code snippets to the code (look at the following examples).
As a result, Databand tracks these operations and sends the associated metadata to the Data interactions tab for the runs.
from dbnd import dataset_op_logger
#### Read example
with dataset_op_logger(source_path, "read") as logger:
df = read(path, ...)
logger.set(data=df)
#### Write example
df = Dataframe(data)
with dataset_op_logger(target_path, "write") as logger:
write(df, ...)
logger.set(data=df)
A basic code example
from dbnd import dbnd_tracking, dataset_op_logger
if __name__ == '__main__':
with dbnd_tracking():
op_path = "/path/to/value.csv"
with dataset_op_logger(op_path=op_path, op_type="read") as logger:
df = pd.read_csv(op_path, ...)
logger.set(data=df)
- The path for your dataset
- Provide the full URI for your file, table, API, or other dataset, see Dataset paths.
- The type of operation
- Specify whether the operation is a
read
or awrite
. Make sure that you wrap only operation-related code with thedataset_op_logger
context manager, see #tracking-python-datasets__dataset-operation-context. - The dataframe that contains the records that are read or written in your operation
- Provide this information so that Databand has visibility into your operation and metadata can be collected.
- Optional parameters
- You can also provide more parameters to limit or enhance the metadata that is logged in your operation, see Optional parameters.
Dataset paths
In order for Databand to track your datasets from different sources, you must provide URI paths in a consistent manner. For example, a file might be written by one task and then read downstream by a different task. Consistency helps to make sure that this dataset is identified as the same dataset across tasks. The following sections show examples of correct URI formats for:
- Standard file systems
- Project storage
- Data warehouse
URIs are case-sensitive.
For standard file systems or object storage, provide a fully qualified URI. Look at the following examples:
-
file://data/local_data.csv
-
s3://bucket/key/dataset.csv
-
gs://bucket/key/dataset.csv
-
wasb://containername@accountname.blob.core.windows.net/dataset_path
For data warehouses, provide the hostname of your warehouse, the region of the warehouse, or both, along with the path to your table. Look at the following examples:
-
bigquery://region/project/dataset/table
-
snowflake://name.region.cloud/database/schema/table
-
redshift://host/database/schema/table
Dataset operation context
Wrap only operation-related code with the dataset_op_logger
context manager.
Place anything that isn't related to reading or writing your dataset outside the context, otherwise
unrelated errors might flag your dataset operation as failing. Look at the following examples:
- Correct example of dataset operation tracking
- In the following example the
unrelated_func()
is placed outside the context and the read is successful. - Incorrect example of dataset operation tracking
- In the following example the
unrelated_func()
is placed inside the context. The read is successful but if unrelated_func raises an exception, a failed read operation is reported to Databand.
Supported dataset types
dataset_op_logger
supports logging of Pandas and PySpark dataframes. It also
supports a list of dictionaries.
When you fetch data from an external API, the data often comes in the following form:
data = [
{
"Name": "Name 1",
"ID": 1,
"Information": "Some information"
},
{
"Name": "Name 2",
"ID": 2,
"Information": "Other information"
},
...
]
When you provide this list of dictionaries as the data argument to
dataset_op_logger
, you can report its schema and
volume:
from dbnd import dataset_op_logger
with dataset_op_logger("http://some/api/response.json", "read"):
logger.set(data=data)
Volume is determined by calculating the length of this list.
In this example, the volume is 2
.
Schema is determined by flattening the dictionary. In this example, the schema is:
-
Name: str
-
ID: int
-
Information: str
Optional parameters
In addition to the dataset path and operation type, dataset_op_logger
also
accepts certain optional parameters that can limit or enhance the metadata that is logged in your
operation.
Not all parameters are supported for every type of dataset object. For more information about which parameters are supported by each type of object, see the Logged metadata section in Dataset logging.
The following list shows the optional parameters, their default values, and descriptions:
-
with_schema
(True) - Extracts the schema of the dataset so that you can view the column names and data types in Databand.
-
with_preview
(False) - Extracts a preview of your dataset so that it can be displayed in Databand. The number of records in the preview depends on the size of the data, but it generally amounts to 10-20 preview records.
-
with_stats
(True) - Calculates column-level stats on your dataset. This parameter includes numerous statistical
measures such as distinct and null counts, averages, standard deviations, mins and maxes, and
quartiles. To enable column-level stats, the
with_schema
parameter cannot be set to False. -
with_histograms
(False) - Generates bar graphs that show the distribution of values in each column.
-
with_partition
(False) - If the file paths of your datasets are partitioned, you can use
with_partition=True
to make sure that the same dataset across partitions resolves to a single dataset in Databand. For example,s3://bucket/key/date=20220415/file.csv
ands3://bucket/key/date=20220416/file.csv
are interpreted as two distinct datasets by default in Databand. Enabling thewith_partition
parameter ignores the partitions when it parses the dataset name. As a result, you can easily track trends and set alerts across runs.
The with_stats
and with_histograms
parameters increase the
performance time of your pipeline because every value in every column must be profiled. To make sure
that the performance tradeoff is acceptable, test the parameters in a development environment
against datasets that are similar in size to the datasets in your production environment.
LogDataRequest
parameterWith the
LogDataRequest
parameter, you can limit which columns are included when the system
calculates statistics with with_stats
or histograms with
with_histograms
. Because statistics and histograms are calculated at runtime and
prevent extra tasks from running until those are completed, LogDataRequest
helps
ensure that you are spending time and compute resources only on the columns that are relevant to
your use case. For example, there is probably little value in calculating statistics and histograms
on your primary and foreign key fields, so you can use LogDataRequest
to exclude
those fields specifically.
The LogDataRequest
has the following
attributes:
-
include_columns
- A list of column names to include.
-
exclude_columns
- A list of column names to exclude.
-
include_all_boolean
,include_all_numeric
,include_all_string
- A list of Boolean, numeric, or string columns to include.
LogDataRequest
, for the
with_histograms
option:from dbnd import log_dataset_op, LogDataRequest
log_dataset_op("customers_data",
data,
with_histograms=LogDataRequest(
include_all_numeric=True,
exclude_columns=["name", "phone"])
)
Alternatively, you can use the following helper methods:
-
LogDataRequest.ALL()
-
LogDataRequest.ALL_STRING()
-
LogDataRequest.ALL_NUMERIC()
-
LogDataRequest.ALL_BOOLEAN()
-
LogDataRequest.NONE()
Providing user-defined meta information for dataset logging
With dataset_op_logger
, you can also add user-defined meta information for your
dataframe. In such a case use, set_metadata
instead of calling
set_data
. You can create ValueMeta by using ValueMeta.basic function as in the
following example:
from dbnd._core.tracking.schemas.column_stats import ColumnStatsArgs
with dataset_op_logger(
op_path="s3://path/to/file.csv",
op_type="read"
) as logger:
columns = [ColumnStatsArgs(column_name='col1', column_type='type_t'),
ColumnStatsArgs('col2', 'type_t', mean_value=53.0),
ColumnStatsArgs('col3', 'type_t', min_value=23.33)]
meta = ValueMeta.basic(columns, records_count=55)
logger.set_metadata(meta)