Tracking Redshift
Overview
Databand allows you to log your dataset operations when using Python to call SQL commands on Redshift. Wrapping your Redshift cursor's execution with Databand's RedshiftTracker
context manager will catch the cursor's result and
extract Dataset Logging operations from it. Currently, only COPY queries are supported.
Requirements
This guide assumes that your Redshift is configured to accept inbound connections. We will be using psycopg2
to connect to Redshift. Currently, DBND only supports psycopg2
connections as the connection parameter.
Make sure that the dbnd-redshift
package is installed (via pip install databand[redshift]
, for example). See more info at Installing DBND.
Integrating RedshiftTracker
Assume the following code is what you are currently using to copy files into a Redshift table:
import psycopg2
SQL_QUERY = """
COPY DB.PUBLIC.TABLE
FROM s3://path/to/some/file/file.csv
iam_role '<role>' csv;
"""
with psycopg2.connect(
host=REDSHIFT_HOST,
port=REDSHIFT_PORT,
database=REDSHIFT_DB,
user=REDSHIFT_USER,
password=REDSHIFT_PASSWORD
) as con:
c = con.cursor()
c.execute(SQL_QUERY)
To log the results of your query with Databand, wrap the execution of your cursor with the RedshiftTracker
context manager:
from dbnd_redshift import RedshiftTracker, RedshiftTrackerConfig
with RedshiftTracker(
conf=RedshiftTrackerConfig(with_preview=True, with_stats=True, with_schema=True)
):
c.execute(SQL_QUERY)
The above will capture the results of the executed query. Only one query execution should be provided for each RedshiftTracker
context.
Optional Parameters
By default, RedshiftTracker
will capture the paths of your file and Redshift table, the column and row counts of the data being copied, and the schema of the data being copied. In addition to these metrics, additional metadata can
be captured by passing the optional parameters below to RedshiftTrackerConfig
as part of your RedshiftTracker
integration:
with_preview=True
: Display a sample of your data in Databand (approximately 10-20 records).with_stats=True
: Calculate the following column-level stats for the data being copied:- Null count and percentage
- Non-null count
- Distinct count
- Mean
- Standard deviation
- Min and max
- 25%/50%/75% quartiles
with_percentiles=False
: This will disable calculation of the 25%/50%/75% quartiles whenwith_stats=True
. This can help speed up the calculation of column-level statistics on large datasets.with_partition=True
: If your file path includes partitioning such as/date=20220415/
, you can use this parameter to ignore partitions in the parsing of your dataset names. This will help ensure that datasets across runs that have different partitioning will still be evaluated as the same dataset for the sake of trends and alerts.
When using with_stats=True
or with_preview=True
, RedshiftTracker
creates a temporal table in your database using the copied records, and this is used to extract stats and/or a preview of the data. This
results in additional queries being sent to your Redshift cluster.
Our benchmarks for the additional overhead from using these parameters are as follows:
- CPU usage is increased by approximately 2.5x when using
with_stats=True
- Total processing time is increased by approximately 2.5-3.5X when using
with_stats=True
For example, a query copying 75 million records took 2m54s as a baseline, and that same query took 9m24s when using with_stats=True
.
Additional considerations:
- When using
with_stats=True
, execution of subsequent tasks will be blocked until the column-level stats have been calculated and extracted asRedshiftTracker
's queries are executed synchronously with the COPY query. - The queries initiated by
RedshiftTracker
do not support queue prioritization, so they will be given the same priority as any other queries that may be queued.
Viewing Redshift Operations in Databand
Operations logged by RedshiftTracker
will result in two datasets: the read from the file and the write to the Redshift table. Metrics related to these datasets such as the row counts, schemas, and column-level stats can be viewed
on both the Data Interactions tab of your pipeline run as well as the Datasets page in your Databand environment.
Data Interactions Tab
Datasets Page
Limitations
- Nested queries are not supported by
RedshiftTracker
(e.g. COPY (SELECT * FROM TABLE) table FROM...). - As part of column-level stats collection, the mean value for a column is calculated using the DOUBLE PRECISION data type.