Datasets
With Databand, you can track datasets for your runs.
Using dataset_op_logger
Using the Databand dataset_op_logger
context manager, you can log numerous attributes about the datasets that are processed by your Python code. For example, you can log the dataset shape, schema, column-level stats, and whether
the dataset operation was successful. To track datasets for your run, create the code in which:
- You connect to the Databand environment to enable tracking.
- Add relevant code snippets to the code (look at the following examples). As a result, Databand tracks these operations and sends the associated metadata to the Data Interactions tab.
Code snippet examples:
from dbnd import dataset_op_logger
#### Read example
with dataset_op_logger(source_path, "read") as logger:
df = read(path, ...)
logger.set(data=df)
#### Write example
df = Dataframe(data)
with dataset_op_logger(target_path, "write") as logger:
write(df, ...)
logger.set(data=df)
To report a dataset operation to Databand, you need to provide the following information:
-
The path for your dataset. Provide the full URI for your file, table, API, or other dataset.
-
The type of operation. Specify whether the operation is a
read
or awrite
. -
The dataframe that contains the records that are read or written in your operation. Provide this information so that Databand has visibility into your operation and metadata can be collected.
A basic code example
The following code demonstrates how to use Databand's tracking system to log metadata about the operations performed on datasets.
from dbnd import dbnd_tracking, dataset_op_logger
if __name__ == '__main__':
with dbnd_tracking():
op_path = "/path/to/value.csv"
with dataset_op_logger(op_path=op_path, op_type="read") as logger:
df = pd.read_csv(op_path, ...)
logger.set(data=df)
Dataset paths
In order for Databand to track your datasets from different sources, you must provide URI paths in a consistent manner. For example, a file might be written by one task and then read downstream by a different task. Consistency helps to make sure that this dataset is identified as the same dataset across tasks. The following sections show examples of correct URI formats for:
- Standard file systems
- Project storage
- Data warehouse
URIs are case-sensitive.
Standard file systems or object storage
For standard file systems or object storage, provide a fully qualified URI. Look at the following examples:
file://data/local_data.csv
s3://bucket/key/dataset.csv
gs://bucket/key/dataset.csv
wasb://containername@accountname.blob.core.windows.net/dataset_path
Data warehouse
For data warehouses, provide the hostname of your warehouse, the region of the warehouse, or both, along with the path to your table. Look at the following examples:
bigquery://region/project/dataset/table
snowflake://name.region.cloud/database/schema/table
redshift://host/database/schema/table
Dataset operation context
Wrap only operation-related code with the dataset_op_logger
context manager. Place anything that isn't related to reading or writing your dataset outside the context, otherwise unrelated errors might flag your dataset operation
as failing. Look at the following examples:
A correct example of dataset operation tracking
from dbnd import dataset_op_logger
with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
value = read_from()
logger.set(data=value)
# Read is successful
unrelated_func()
An incorrect example of dataset operation tracking
from dbnd import dataset_op_logger
with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
value = read_from()
logger.set(data=value)
# Read is successful
unrelated_func()
# If unrelated_func raises an exception, a failed read operation is reported to Databand.
Supported dataset types
Databand supports the following dataset types:
Built-in support
By default, dataset_op_logger
supports logging of Pandas and PySpark dataframes.
List of dictionaries
When you fetch data from an external API, the data often comes in the following form:
data = [
{
"Name": "Name 1",
"ID": 1,
"Information": "Some information"
},
{
"Name": "Name 2",
"ID": 2,
"Information": "Other information"
},
...
]
When you provide this list of dictionaries as the data argument to dataset_op_logger
, you can report its schema and volume:
from dbnd import dataset_op_logger
with dataset_op_logger("http://some/api/response.json", "read"):
logger.set(data=data)
Volume is determined by calculating the length of this list. In this example, the volume is 2
.
Schema is determined by flattening the dictionary. In this example, the schema is: Name: str
, ID: int
, Information: str
.
Optional parameters
In addition to the dataset path and operation type, dataset_op_logger
also accepts certain optional parameters that can limit or enhance the metadata and that is logged in your operation.
Not all parameters are supported for every type of dataset object. For more information about which parameters are supported by each type of object, see the Logged metadata section in Dataset logging.
If you provide user-defined metadata, the optional parameters are skipped when you run the script.
The following list shows the optional parameters, their default values, and descriptions:
-
with_schema
(True): Extracts the schema of the dataset so that you can view the column names and data types in Databand. -
with_preview
(False): Extracts a preview of your dataset so that it can be displayed in Databand. The number of records in the preview depends on the size of the data, but it generally amounts to 10-20 preview records. -
with_stats
(True): Calculates column-level stats on your dataset. This parameter includes numerous statistical measures such as distinct and null counts, averages, standard deviations, mins and maxes, and quartiles.
To enable column-level stats, with_schema
cannot be set to False.
-
with_histograms
(False): Generates bar graphs that show the distribution of values in each column. -
with_partition
(False): If the file paths of your datasets are partitioned, you can usewith_partition=True
to make sure that the same dataset across partitions resolves to a single dataset in Databand.
For example,s3://bucket/key/date=20220415/file.csv
ands3://bucket/key/date=20220416/file.csv
are interpreted as two distinct datasets by default in Databand. Enabling thewith_partition
parameter ignores the partitions when it parses the dataset name. As a result, you can easily track trends and set alerts across runs.
The with_stats
and with_histograms
parameters increase the performance time of your pipeline because every value in every column must be profiled. To make sure that the performance tradeoff is acceptable, test the
parameters in a development environment against datasets that are a similar size to the datasets in your production environment.
LogDataRequest
parameter
The LogDataRequest
parameter controls what histogram we should apply. With the parameter you can define that only Boolean columns will be calculated. For example, for the include_all_string
=True attribute, it will
enable a histogram for all string columns.
The LogDataRequest
has the following attributes:
include_columns
: A list of column names to includeexclude_columns
: A list of column names to excludeinclude_all_boolean
,include_all_numeric
,include_all_string
: A list of Boolean, numeric, or string columns to include.
The LogDataRequest
is a valid parameter for the with_stats
and with_histogram
options only.
The following example shows how to use the LogDataRequest
, for the with_histograms
option:
from dbnd import log_dataset_op, LogDataRequest
log_dataset_op("customers_data",
data,
with_histograms=LogDataRequest(
include_all_numeric=True,
exclude_columns=["name", "phone"])
)
Alternatively, you can use the following helper methods:
LogDataRequest.ALL()
LogDataRequest.ALL_STRING()
LogDataRequest.ALL_NUMERIC()
LogDataRequest.ALL_BOOLEAN()
LogDataRequest.NONE()
Providing user-defined meta information for dataset logging
With dataset_op_logger
, you can also add user-defined meta information for your dataframe. In such a case:
- You don't need to call
set_data
. - Use
set_metadata
instead. You can create ValueMeta by using ValueMeta.basic function as in the following example:
from dbnd._core.tracking.schemas.column_stats import ColumnStatsArgs
with dataset_op_logger(
op_path="s3://path/to/file.csv",
op_type="read"
) as logger:
columns = [ColumnStatsArgs(column_name='col1', column_type='type_t'),
ColumnStatsArgs('col2', 'type_t', mean_value=53.0),
ColumnStatsArgs('col3', 'type_t', min_value=23.33)]
meta = ValueMeta.basic(columns, records_count=55)
logger.set_metadata(meta)