Python SDK configuration

Python Databand SDK uses its own configuration system that you can use to set and update your configuration in a way that is most suitable for your needs.

The Python Databand SDK consists of:

  • Environment variables
  • Configuration files
  • Code
  • External configs, such as an Airflow connection

Environment variables

You can set any system parameter by using environment variables. For example, you can override the databand_url parameter under the core section by setting a value for the DBND__CORE__DATABAND_URL environment variables. For example:
export DBND__CORE__DATABAND_URL="https://yourdataband-service.databand.ai"

Similarly, you can override any environment variable for other configuration parameters by using the DBND__<SECTION>__<KEY> format.

Configuration files in Databand

Databand loads configuration information sequentially from the following configuration files:

Table 1. List of configuration files, their location, and loading priority
File loading priority File location File description
1 $DBND_LIB/databand-core.cfg Provides the default core configuration of the system that cannot be changed.
2 $DBND_SYSTEM/databand-system.cfg Provides middle layer configuration. Use this file to configure project infrastructure.
3 $DBND_HOME/project.cfg Provides a project configuration. Use for configuring user-facing parts of the project.
4 $USER_HOME/.dbnd/databand.cfg Provides system user configuration.

You can also create configuration files in a custom location. Use the DBND__CONF__FILE environment variable to create custom files.

Configuration information that is available in files that is specified in the inferior configuration layers overrides the configuration that is specified in the superior layers. For example, you specified config key A in the $DBND_SYSTEM/databand-system.cfg. If in $DBND_HOME/project.cfg the configuration of A is specified, then Databand uses the configuration that is specified in the inferior configuration layer, for example $DBND_HOME/project.cfg. For more information, see ConfigParser.read.

Environment variables in configuration files

You can use $DBND_HOME, $DBND_LIB or $DBND_SYSTEM in your configuration file, or any other environment variable as shown in the following section.

[core]
databand_url="${YOUR_ENV_VARIABLE}"

Changing the configuration for a specific section of the code

You can use the config context manager to set up the configuration in the code:
from dbnd import config
with config({"section": {"key": "value"}}):
    pass
You can also load configuration from a file:
from dbnd import config

from dbnd._core.configuration.config_readers import read_from_config_file
with config(read_from_config_file("/path/to/config.cfg")):
    pass

Python SDK advanced configuration

With Python Databand SDK, you can also perform more advanced configuration:

  • Passing a list parameters in a .cfg file
  • Passing a dictionary parameter in a .cfg file
  • Using configuration files with different use cases of files with the default variables overridden in production or test
  • Using multiple extra configuration files
  • Controlling the output type of dbnd_config

Passing list parameters in a .cfg file

To pass list parameters in a .cfg file, use the following syntax:

[some_section]
list_param = [1, 2, 3, "str1"]

Passing a dictionary parameter in a .cfg file

To pass dictionary parameters in a .cfg file, use the following syntax:

[some_section]
dict_param = {"key1": "value1", 255: 3}

Using configuration files with different use cases of files with the default variables overridden in production or test

You can also use configuration files, for example files with different use cases or files where some default variables are overridden in production or test. To specify such a file, you can set an environment variable:

export DBND_CONFIG=<extra_file_path>

Using multiple extra configuration files

With -- conf and DBND__DATABAND__CONF variables, you can add multiple files (such as a list of files separated by comma).

Controlling the output type of dbnd__config

The dbnd_config is a 'dict'-like object that stores only the mapping to value.

To control the output type of dbnd_config.get ("section", "key"), you can use getboolean, getint, or getfloat (for permeable types).

Changing the configuration to control the tracking store behavior when errors occur

The Python Databand SDK uses the tracking system to report the state of the runs or tasks to the Databand web server. Errors can occur when you report important information, and these errors can cause invalid states for the runs you see in the Databand webapp.

The tracking system uses a different tracking store, and each reports the information to a different location, for example:

  • Web tracking-store - reports to Databand webserver.
  • Console tracking-store - writes the events to the console.

To control the behavior of the tracking system when errors occur, use the following configurations under the core section:

[core]
remove_failed_store=true
tracker_raise_on_error=true
remove_failed_store
The parameter removes a tracking store if multiple fails occur. Default value = false.
max_tracking_store_retries
The parameter defines the maximal amount of retries allowed for a single tracking store call if it fails. Default value = 2.
tracker_raise_on_error
The parameter stops the run with an error if a critical error occurs on the tracking like failing to connect the web-server. Default value = true.

Changing the configuration to control logging of parameters within decorated functions

The logging configuration options help you save computational resources and protect against reporting of sensitive data, such as full data previews.

Logging data processes and full data quality reports in Databand can be resource-intensive. However, explicitly turning off all calculations for log_value_size, log_value_schema, log_value_stats, log_value_preview, log_value_preview_max_len, log_value_meta results in valuable metrics not being tracked at all. To help you better manage logging performance and visibility needs, it's now possible to selectively calculate and log metadata through a zero-computational-cost approach.

When you use a new configuration, you can decide whether you want to log certain information with the help of value_reporting_strategy. value_reporting_strategy changes nothing in your code, but acts as a guard (or fuse) before the value calculation code gets to execution:

[tracking]
value_reporting_strategy=SMART

The following options are available:

ALL
No restrictions for logging. All the log_value_ types are on: log_value_size, log_value_scheme, log_value_stats, log_value_preview, log_value_preview_max_len, log_value_meta.
SMART
Restrictions on lazy evaluation types. For types like Spark, values are only calculated when they are needed. The calculations for lazy evaluation types are restricted and even if you have log_value_preview set to True, when the SMART strategy is on, the Spark previews are not logged.
NONE
No logging of anything expensive or potentially problematic. This option can be useful if you have some or many values that constitute private and sensitive information, and you don’t want them to be logged.
<!-- ALL
No restrictions for logging. All the log_value_ types are on: log_value_size, log_value_scheme, log_value_stats, log_value_preview, log_value_preview_max_len, log_value_meta.
SMART
Restrictions on lazy evaluation types. For types like Spark, values are only calculated when they are needed. The calculations for lazy evaluation types are restricted and even if you have log_value_preview set to True, when the SMART strategy is on, the Spark previews are not logged.
NONE
No logging of anything expensive or potentially problematic. This option can be useful if you have some or many values that constitute private and sensitive information, and you don’t want them to be logged.-->

Most users can benefit from using the SMART option for logging.

The list of available [tracking] configuration parameters

You can add the following parameters to the [tracking] configuration:

project
Set the project to which the run is assigned. If you don't set this value, the default project is used. The tracking server selects a project with is_default == True.
databand_external_url
Set a tracker URL to be used for tracking from external systems.
log_value_size
Calculate and log the value's size. Enabling this parameter causes a full scan on nonindexable distributed memory objects.
log_value_schema
Calculate and log the value's schema.
log_value_stats
Calculate and log the value's stats. This parameter is expensive to calculate, so it can be better to use log_stats on the parameter level.
log_value_preview
Calculate and log the value's preview. This parameter can be expensive to calculate on Spark.
log_value_preview_max_len
Set the max size of the value's preview to be saved at the service. The max value of this parameter is 50000.
log_value_meta
Calculate and log the value's meta.
log_histograms
Enable calculation and tracking of histograms. This parameter can be expensive.
value_reporting_strategy
Set the strategy used for the reporting of values. You have multiple strategy options, each with different limitations on potentially expensive calculations for value_meta. ALL removes all limitations. SMART limits lazy evaluation types. NONE, which is the default value, limits everything.
track_source_code
Enable tracking of function, module, and file source code.
auto_disable_slow_size
Enable automatically disabling slow previews for Spark DataFrame with text formats.
flatten_operator_fields
Control which of the operator's fields are flattened when tracked.
capture_tracking_log
Enable log-capturing for tracking tasks.

[core] configuration section parameter reference

You can add the following parameters to the tracking context by adding configuration by using the conf parameter of the dbnd_tracking function. This is not recommended for production usage.

databand_url
Set the tracker URL to be used for creating links in the console logs.
databand_access_token
Set the personal access token used to connect to the Databand web server.
extra_default_headers
Specify extra headers to be used as defaults for databand_api_client.
tracker
Set the tracking stores to be used.
tracker_api
Set the tracker channels to be used by the 'api' store.
debug_webserver
Enable collecting the webserver's logs for each api-call on the local machine. It needs to be supported by the web-server.
silence_tracking_mode
Enable silencing the console when in tracking mode.
tracker_raise_on_error
Enable raising an error when failed to track data.
remove_failed_store
Enable removal of a tracking store if it fails.
max_tracking_store_retries
Set the maximum number of retries allowed for a single tracking store call if it fails.
client_session_timeout
Set the number of minutes to re-create the api client's session.
client_max_retry
Set the maximum number of retries on a failed connection for the api client.
client_retry_sleep
Set the amount of sleep time in between retries of the API client.
user_configs
Set the config used for creating tasks from the user code.
user_init
Runs in every dbnd process with the system configuration in place. This is called in DatabandContext after entering initialization steps by the SDK.
user_driver_init
Runs in a driver after configuration initialization. This is called from DatabandContext when Python runtime is entering a new context.
user_code_on_fork
Runs in a sub process, on parallel, Kubernetes, or external modes.
plugins
Specify which plug-ins to load on Databand context creations.
allow_vendored_package
Enable adding the dbnd/_vendor_package module to your system path.
fix_env_on_osx
Enable adding no_proxy=* to environment variables, fixing issues with multiprocessing on OSX.
environments
Set a list of enabled environments.
dbnd_user
Set which user to connect to the Databand web server. This parameter is deprecated.
dbnd_password
Set what password needs to be used to connect to the Databand web server. This parameter is deprecated.
tracker_url
Set the tracker URL to be used for creating links in console logs. This parameter is deprecated.