Datasets

With Databand, you can track datasets for your runs.

Using dataset_op_logger

Using the Databand dataset_op_logger context manager, you can log numerous attributes about the datasets that are processed by your Python code. For example, you can log the dataset shape, schema, column-level stats, and whether the dataset operation was successful. To track datasets for your run, create the code in which:

  1. You connect to the Databand environment to enable tracking.
  2. Add relevant code snippets to the code (look at the following examples). As a result, Databand tracks these operations and sends the associated metadata to the Data Interactions tab.

Code snippet examples:

from dbnd import dataset_op_logger

#### Read example
with dataset_op_logger(source_path, "read") as logger:
    df = read(path, ...)
    logger.set(data=df)

#### Write example
df = Dataframe(data)

with dataset_op_logger(target_path, "write") as logger:
    write(df, ...)
    logger.set(data=df)

To report a dataset operation to Databand, you need to provide the following information:

  • The path for your dataset. Provide the full URI for your file, table, API, or other dataset.

  • The type of operation. Specify whether the operation is a read or a write.

  • The dataframe that contains the records that are read or written in your operation. Provide this information so that Databand has visibility into your operation and metadata can be collected.

A basic code example

The following code demonstrates how to use Databand's tracking system to log metadata about the operations performed on datasets.

from dbnd import dbnd_tracking, dataset_op_logger

if __name__ == '__main__':
    with dbnd_tracking():

        op_path = "/path/to/value.csv"
        with dataset_op_logger(op_path=op_path, op_type="read") as logger:
            df = pd.read_csv(op_path, ...)
            logger.set(data=df)

Dataset paths

In order for Databand to track your datasets from different sources, you must provide URI paths in a consistent manner. For example, a file might be written by one task and then read downstream by a different task. Consistency helps to make sure that this dataset is identified as the same dataset across tasks. The following sections show examples of correct URI formats for:

  • Standard file systems
  • Project storage
  • Data warehouse

URIs are case-sensitive.

Standard file systems or object storage

For standard file systems or object storage, provide a fully qualified URI. Look at the following examples:

  • file://data/local_data.csv
  • s3://bucket/key/dataset.csv
  • gs://bucket/key/dataset.csv
  • wasb://containername@accountname.blob.core.windows.net/dataset_path

Data warehouse

For data warehouses, provide the hostname of your warehouse, the region of the warehouse, or both, along with the path to your table. Look at the following examples:

  • bigquery://region/project/dataset/table
  • snowflake://name.region.cloud/database/schema/table
  • redshift://host/database/schema/table

Dataset operation context

Wrap only operation-related code with the dataset_op_logger context manager. Place anything that isn't related to reading or writing your dataset outside the context, otherwise unrelated errors might flag your dataset operation as failing. Look at the following examples:

A correct example of dataset operation tracking

from dbnd import dataset_op_logger

with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)
    # Read is successful

unrelated_func()

An incorrect example of dataset operation tracking

from dbnd import dataset_op_logger

with dataset_op_logger("s3://path/to/file.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)
    # Read is successful
    unrelated_func()
    # If unrelated_func raises an exception, a failed read operation is reported to Databand.

Supported dataset types

Databand supports the following dataset types:

Built-in support

By default, dataset_op_logger supports logging of Pandas and PySpark dataframes.

List of dictionaries

When you fetch data from an external API, the data often comes in the following form:

data = [
  {
    "Name": "Name 1",
    "ID": 1,
    "Information": "Some information"
  },
  {
    "Name": "Name 2",
    "ID": 2,
    "Information": "Other information"
  },
  ...
]

When you provide this list of dictionaries as the data argument to dataset_op_logger, you can report its schema and volume:

from dbnd import dataset_op_logger

with dataset_op_logger("http://some/api/response.json", "read"):
    logger.set(data=data)

Volume is determined by calculating the length of this list. In this example, the volume is 2.

Schema is determined by flattening the dictionary. In this example, the schema is: Name: str, ID: int, Information: str.

Optional parameters

In addition to the dataset path and operation type, dataset_op_logger also accepts certain optional parameters that can limit or enhance the metadata and that is logged in your operation.

Not all parameters are supported for every type of dataset object. For more information about which parameters are supported by each type of object, see the Logged metadata section in Dataset logging.

If you provide user-defined metadata, the optional parameters are skipped when you run the script.

The following list shows the optional parameters, their default values, and descriptions:

  • with_schema (True): Extracts the schema of the dataset so that you can view the column names and data types in Databand.

  • with_preview (False): Extracts a preview of your dataset so that it can be displayed in Databand. The number of records in the preview depends on the size of the data, but it generally amounts to 10-20 preview records.

  • with_stats (True): Calculates column-level stats on your dataset. This parameter includes numerous statistical measures such as distinct and null counts, averages, standard deviations, mins and maxes, and quartiles.

To enable column-level stats, with_schema cannot be set to False.

  • with_histograms (False): Generates bar graphs that show the distribution of values in each column.

  • with_partition (False): If the file paths of your datasets are partitioned, you can use with_partition=True to make sure that the same dataset across partitions resolves to a single dataset in Databand.
    For example, s3://bucket/key/date=20220415/file.csv and s3://bucket/key/date=20220416/file.csv are interpreted as two distinct datasets by default in Databand. Enabling the with_partition parameter ignores the partitions when it parses the dataset name. As a result, you can easily track trends and set alerts across runs.

The with_stats and with_histograms parameters increase the performance time of your pipeline because every value in every column must be profiled. To make sure that the performance tradeoff is acceptable, test the parameters in a development environment against datasets that are a similar size to the datasets in your production environment.

LogDataRequest parameter

The LogDataRequest parameter controls what histogram we should apply. With the parameter you can define that only Boolean columns will be calculated. For example, for the include_all_string=True attribute, it will enable a histogram for all string columns.

The LogDataRequest has the following attributes:

  • include_columns: A list of column names to include
  • exclude_columns: A list of column names to exclude
  • include_all_boolean, include_all_numeric, include_all_string: A list of Boolean, numeric, or string columns to include.

The LogDataRequest is a valid parameter for the with_stats and with_histogram options only.

The following example shows how to use the LogDataRequest, for the with_histograms option:

from dbnd import log_dataset_op, LogDataRequest

log_dataset_op("customers_data",
               data,
               with_histograms=LogDataRequest(
                   include_all_numeric=True,
                   exclude_columns=["name", "phone"])
               )

Alternatively, you can use the following helper methods:

  • LogDataRequest.ALL()
  • LogDataRequest.ALL_STRING()
  • LogDataRequest.ALL_NUMERIC()
  • LogDataRequest.ALL_BOOLEAN()
  • LogDataRequest.NONE()

Providing user-defined meta information for dataset logging

With dataset_op_logger, you can also add user-defined meta information for your dataframe. In such a case:

  • You don't need to call set_data.
  • Use set_metadata instead. You can create ValueMeta by using ValueMeta.basic function as in the following example:
from dbnd._core.tracking.schemas.column_stats import ColumnStatsArgs
  
  with dataset_op_logger(
          op_path="s3://path/to/file.csv",
          op_type="read"
  ) as logger:
      columns = [ColumnStatsArgs(column_name='col1', column_type='type_t'),
                 ColumnStatsArgs('col2', 'type_t', mean_value=53.0),
                 ColumnStatsArgs('col3', 'type_t', min_value=23.33)]
 
      meta = ValueMeta.basic(columns, records_count=55)
      logger.set_metadata(meta)