Datasets
With Databand, you can track datasets for your runs.
Using dataset_op_logger
Using the Databand dataset_op_logger
context manager, you can log numerous attributes about the datasets that are processed by your Python code. For example, you can log the dataset shape, schema, column-level stats, and whether
the dataset operation was successful. To track datasets for your run:
- Create the code in which you connect to the Databand environment to enable tracking, see Python.
- Add code snippets to the code (look at the following examples).
As a result, Databand tracks these operations and sends the associated metadata to the Data interactions tab for the runs.
The code snippet for read and write examples:
from dbnd import dataset_op_logger
#### Read example
with dataset_op_logger(source_path, "read") as logger:
df = read(path, ...)
logger.set(data=df)
#### Write example
df = Dataframe(data)
with dataset_op_logger(target_path, "write") as logger:
write(df, ...)
logger.set(data=df)
A basic code example
The following code demonstrates how to use Databand's tracking system to log metadata about the operations performed on datasets - operation type: read.
from dbnd import dbnd_tracking, dataset_op_logger
if __name__ == '__main__':
with dbnd_tracking():
op_path = "/path/to/value.csv"
with dataset_op_logger(op_path=op_path, op_type="read") as logger:
df = pd.read_csv(op_path, ...)
logger.set(data=df)
As you can see from the example to report a dataset operation to Databand, you need to provide the following information in the code:
- The path for your dataset. Provide the full URI for your file, table, API, or other dataset, see Dataset paths.
- The type of operation. Specify whether the operation is a
read
or awrite
. Make sure that you wrap only operation-related code with thedataset_op_logger
context manager, see Dataset operation context. - The dataframe that contains the records that are read or written in your operation. Provide this information so that Databand has visibility into your operation and metadata can be collected.
- You can also provide more parameters to limit or enhance the metadata that is logged in your operation, see Optional parameters.
Dataset paths
In order for Databand to track your datasets from different sources, you must provide URI paths in a consistent manner. For example, a file might be written by one task and then read downstream by a different task. Consistency helps to make sure that this dataset is identified as the same dataset across tasks. The following sections show examples of correct URI formats for:
- Standard file systems
- Project storage
- Data warehouse
URIs are case-sensitive.
Standard file systems or object storage
For standard file systems or object storage, provide a fully qualified URI. Look at the following examples:
file://data/local_data.csv
s3://bucket/key/dataset.csv
gs://bucket/key/dataset.csv
wasb://containername@accountname.blob.core.windows.net/dataset_path
Data warehouse
For data warehouses, provide the hostname of your warehouse, the region of the warehouse, or both, along with the path to your table. Look at the following examples:
bigquery://region/project/dataset/table
snowflake://name.region.cloud/database/schema/table
redshift://host/database/schema/table
Dataset operation context
Wrap only operation-related code with the dataset_op_logger
context manager. Place anything that isn't related to reading or writing your dataset outside the context, otherwise unrelated errors might flag your dataset operation
as failing. Look at the following examples:
Correct example of dataset operation tracking
In the following example the unrelated_func()
is placed outside the context and the read is successful.
from dbnd import dataset_op_logger
with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
value = read_from()
logger.set(data=value)
unrelated_func()
Incorrect example of dataset operation tracking
In the following example the unrelated_func()
is placed inside the context. The read is successful but if unrelated_func raises an exception, a failed read operation is reported to Databand.
from dbnd import dataset_op_logger
with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
value = read_from()
logger.set(data=value)
unrelated_func()
Supported dataset types
dataset_op_logger
supports logging of Pandas and PySpark dataframes. It also supports a list of dictionaries.
List of dictionaries
When you fetch data from an external API, the data often comes in the following form:
data = [
{
"Name": "Name 1",
"ID": 1,
"Information": "Some information"
},
{
"Name": "Name 2",
"ID": 2,
"Information": "Other information"
},
...
]
When you provide this list of dictionaries as the data argument to dataset_op_logger
, you can report its schema and volume:
from dbnd import dataset_op_logger
with dataset_op_logger("http://some/api/response.json", "read"):
logger.set(data=data)
Volume is determined by calculating the length of this list. In this example, the volume is 2
.
Schema is determined by flattening the dictionary. In this example, the schema is:
Name: str
ID: int
Information: str
Optional parameters
In addition to the dataset path and operation type, dataset_op_logger
also accepts certain optional parameters that can limit or enhance the metadata that is logged in your operation.
Not all parameters are supported for every
type of dataset object. For more information about which parameters are supported by each type of object, see the Logged metadata section in Dataset logging.
If you provide user-defined metadata, the optional parameters are skipped when you run the script.
The following list shows the optional parameters, their default values, and descriptions:
with_schema
(True)-
Extracts the schema of the dataset so that you can view the column names and data types in Databand.
with_preview
(False)-
Extracts a preview of your dataset so that it can be displayed in Databand. The number of records in the preview depends on the size of the data, but it generally amounts to 10-20 preview records.
with_stats
(True)-
Calculates column-level stats on your dataset. This parameter includes numerous statistical measures such as distinct and null counts, averages, standard deviations, mins and maxes, and quartiles. To enable column-level stats, the `with_schema` parameter cannot be set to False.
with_histograms
(False)-
Generates bar graphs that show the distribution of values in each column.
with_partition
(False)-
If the file paths of your datasets are partitioned, you can use
with_partition=True
to make sure that the same dataset across partitions resolves to a single dataset in Databand. For example,s3://bucket/key/date=20220415/file.csv
ands3://bucket/key/date=20220416/file.csv
are interpreted as two distinct datasets by default in Databand. Enabling thewith_partition
parameter ignores the partitions when it parses the dataset name. As a result, you can easily track trends and set alerts across runs.
The with_stats
and with_histograms
parameters increase the performance time of your pipeline because every value in every column must be profiled. To make sure that the performance tradeoff is acceptable, test the
parameters in a development environment against datasets that are similar in size to the datasets in your production environment.
LogDataRequest
parameter
With the LogDataRequest
parameter, you can limit which columns are included when the system calculates statistics with with_stats
or histograms with with_histograms
. Because statistics and histograms
are calculated at runtime and prevent extra tasks from running until those are completed, LogDataRequest
helps ensure that you are spending time and compute resources only on the columns that are relevant to your use case.
For example, there is probably little value in calculating statistics and histograms on your primary and foreign key fields, so you can use LogDataRequest
to exclude those fields specifically.
The LogDataRequest
has the following attributes:
include_columns
-
A list of column names to include.
exclude_columns
-
A list of column names to exclude.
include_all_boolean
,include_all_numeric
,include_all_string
-
A list of Boolean, numeric, or string columns to include.
The following example shows how to use the LogDataRequest
, for the with_histograms
option:
from dbnd import log_dataset_op, LogDataRequest
log_dataset_op("customers_data",
data,
with_histograms=LogDataRequest(
include_all_numeric=True,
exclude_columns=["name", "phone"])
)
Alternatively, you can use the following helper methods:
LogDataRequest.ALL()
LogDataRequest.ALL_STRING()
LogDataRequest.ALL_NUMERIC()
LogDataRequest.ALL_BOOLEAN()
LogDataRequest.NONE()
Providing user-defined meta information for dataset logging
With dataset_op_logger
, you can also add user-defined meta information for your dataframe. In such a case use, set_metadata
instead of calling set_data
. You can create ValueMeta by using ValueMeta.basic
function as in the following example:
from dbnd._core.tracking.schemas.column_stats import ColumnStatsArgs
with dataset_op_logger(
op_path="s3://path/to/file.csv",
op_type="read"
) as logger:
columns = [ColumnStatsArgs(column_name='col1', column_type='type_t'),
ColumnStatsArgs('col2', 'type_t', mean_value=53.0),
ColumnStatsArgs('col3', 'type_t', min_value=23.33)]
meta = ValueMeta.basic(columns, records_count=55)
logger.set_metadata(meta)