Datasets

With Databand, you can track datasets for your runs.

Using dataset_op_logger

Using the Databand dataset_op_logger context manager, you can log numerous attributes about the datasets that are processed by your Python code. For example, you can log the dataset shape, schema, column-level stats, and whether the dataset operation was successful. To track datasets for your run:

  1. Create the code in which you connect to the Databand environment to enable tracking, see Python.
  2. Add code snippets to the code (look at the following examples).

As a result, Databand tracks these operations and sends the associated metadata to the Data interactions tab for the runs.

The code snippet for read and write examples:

from dbnd import dataset_op_logger

#### Read example
with dataset_op_logger(source_path, "read") as logger:
    df = read(path, ...)
    logger.set(data=df)

#### Write example
df = Dataframe(data)

with dataset_op_logger(target_path, "write") as logger:
    write(df, ...)
    logger.set(data=df)

A basic code example

The following code demonstrates how to use Databand's tracking system to log metadata about the operations performed on datasets - operation type: read.

from dbnd import dbnd_tracking, dataset_op_logger

if __name__ == '__main__':
    with dbnd_tracking():

        op_path = "/path/to/value.csv"
        with dataset_op_logger(op_path=op_path, op_type="read") as logger:
            df = pd.read_csv(op_path, ...)
            logger.set(data=df)

As you can see from the example to report a dataset operation to Databand, you need to provide the following information in the code:

  • The path for your dataset. Provide the full URI for your file, table, API, or other dataset, see Dataset paths.
  • The type of operation. Specify whether the operation is a read or a write. Make sure that you wrap only operation-related code with the dataset_op_logger context manager, see Dataset operation context.
  • The dataframe that contains the records that are read or written in your operation. Provide this information so that Databand has visibility into your operation and metadata can be collected.
  • You can also provide more parameters to limit or enhance the metadata that is logged in your operation, see Optional parameters.

Dataset paths

In order for Databand to track your datasets from different sources, you must provide URI paths in a consistent manner. For example, a file might be written by one task and then read downstream by a different task. Consistency helps to make sure that this dataset is identified as the same dataset across tasks. The following sections show examples of correct URI formats for:

  • Standard file systems
  • Project storage
  • Data warehouse

URIs are case-sensitive.

Standard file systems or object storage

For standard file systems or object storage, provide a fully qualified URI. Look at the following examples:

  • file://data/local_data.csv
  • s3://bucket/key/dataset.csv
  • gs://bucket/key/dataset.csv
  • wasb://containername@accountname.blob.core.windows.net/dataset_path

Data warehouse

For data warehouses, provide the hostname of your warehouse, the region of the warehouse, or both, along with the path to your table. Look at the following examples:

  • bigquery://region/project/dataset/table
  • snowflake://name.region.cloud/database/schema/table
  • redshift://host/database/schema/table

Dataset operation context

Wrap only operation-related code with the dataset_op_logger context manager. Place anything that isn't related to reading or writing your dataset outside the context, otherwise unrelated errors might flag your dataset operation as failing. Look at the following examples:

Correct example of dataset operation tracking

In the following example the unrelated_func() is placed outside the context and the read is successful.

from dbnd import dataset_op_logger

with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)

unrelated_func()

Incorrect example of dataset operation tracking

In the following example the unrelated_func() is placed inside the context. The read is successful but if unrelated_func raises an exception, a failed read operation is reported to Databand.

from dbnd import dataset_op_logger

with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)
    unrelated_func()

Supported dataset types

dataset_op_logger supports logging of Pandas and PySpark dataframes. It also supports a list of dictionaries.

List of dictionaries

When you fetch data from an external API, the data often comes in the following form:

data = [
  {
    "Name": "Name 1",
    "ID": 1,
    "Information": "Some information"
  },
  {
    "Name": "Name 2",
    "ID": 2,
    "Information": "Other information"
  },
  ...
]

When you provide this list of dictionaries as the data argument to dataset_op_logger, you can report its schema and volume:

from dbnd import dataset_op_logger

with dataset_op_logger("http://some/api/response.json", "read"):
    logger.set(data=data)

Volume is determined by calculating the length of this list. In this example, the volume is 2.

Schema is determined by flattening the dictionary. In this example, the schema is:

  • Name: str
  • ID: int
  • Information: str

Optional parameters

In addition to the dataset path and operation type, dataset_op_logger also accepts certain optional parameters that can limit or enhance the metadata that is logged in your operation.
Not all parameters are supported for every type of dataset object. For more information about which parameters are supported by each type of object, see the Logged metadata section in Dataset logging.

If you provide user-defined metadata, the optional parameters are skipped when you run the script.

The following list shows the optional parameters, their default values, and descriptions:

with_schema (True)

Extracts the schema of the dataset so that you can view the column names and data types in Databand.

with_preview (False)

Extracts a preview of your dataset so that it can be displayed in Databand. The number of records in the preview depends on the size of the data, but it generally amounts to 10-20 preview records.

with_stats (True)

Calculates column-level stats on your dataset. This parameter includes numerous statistical measures such as distinct and null counts, averages, standard deviations, mins and maxes, and quartiles. To enable column-level stats, the `with_schema` parameter cannot be set to False.

with_histograms (False)

Generates bar graphs that show the distribution of values in each column.

with_partition (False)

If the file paths of your datasets are partitioned, you can use with_partition=True to make sure that the same dataset across partitions resolves to a single dataset in Databand. For example, s3://bucket/key/date=20220415/file.csv and s3://bucket/key/date=20220416/file.csv are interpreted as two distinct datasets by default in Databand. Enabling the with_partition parameter ignores the partitions when it parses the dataset name. As a result, you can easily track trends and set alerts across runs.

The with_stats and with_histograms parameters increase the performance time of your pipeline because every value in every column must be profiled. To make sure that the performance tradeoff is acceptable, test the parameters in a development environment against datasets that are similar in size to the datasets in your production environment.

LogDataRequest parameter

With the LogDataRequest parameter, you can limit which columns are included when the system calculates statistics with with_stats or histograms with with_histograms. Because statistics and histograms are calculated at runtime and prevent extra tasks from running until those are completed, LogDataRequest helps ensure that you are spending time and compute resources only on the columns that are relevant to your use case. For example, there is probably little value in calculating statistics and histograms on your primary and foreign key fields, so you can use LogDataRequest to exclude those fields specifically.

The LogDataRequest has the following attributes:

include_columns

A list of column names to include.

exclude_columns

A list of column names to exclude.

include_all_boolean, include_all_numeric, include_all_string

A list of Boolean, numeric, or string columns to include.

The following example shows how to use the LogDataRequest, for the with_histograms option:

from dbnd import log_dataset_op, LogDataRequest

log_dataset_op("customers_data",
               data,
               with_histograms=LogDataRequest(
                   include_all_numeric=True,
                   exclude_columns=["name", "phone"])
               )

Alternatively, you can use the following helper methods:

  • LogDataRequest.ALL()
  • LogDataRequest.ALL_STRING()
  • LogDataRequest.ALL_NUMERIC()
  • LogDataRequest.ALL_BOOLEAN()
  • LogDataRequest.NONE()

Providing user-defined meta information for dataset logging

With dataset_op_logger, you can also add user-defined meta information for your dataframe. In such a case use, set_metadata instead of calling set_data. You can create ValueMeta by using ValueMeta.basic function as in the following example:

from dbnd._core.tracking.schemas.column_stats import ColumnStatsArgs
  
  with dataset_op_logger(
          op_path="s3://path/to/file.csv",
          op_type="read"
  ) as logger:
      columns = [ColumnStatsArgs(column_name='col1', column_type='type_t'),
                 ColumnStatsArgs('col2', 'type_t', mean_value=53.0),
                 ColumnStatsArgs('col3', 'type_t', min_value=23.33)]
 
      meta = ValueMeta.basic(columns, records_count=55)
      logger.set_metadata(meta)