Python SDK configuration
Python Databand SDK uses its own configuration system that you can use to set and update your configuration in a way that is most suitable for your needs.
The Python Databand SDK consists of:
- Environment variables
- Configuration files
- Code
- External configs, such as an Airflow connection
Environment variables
databand_url parameter under the core section by setting a
value for the DBND__CORE__DATABAND_URL environment variables. For
example:export DBND__CORE__DATABAND_URL="https://yourdataband-service.databand.ai"Similarly, you can override any environment variable for other configuration parameters by using
the DBND__<SECTION>__<KEY> format.
Configuration files in Databand
Databand loads configuration information sequentially from the following configuration files:
| File loading priority | File location | File description |
|---|---|---|
| 1 | $DBND_LIB/databand-core.cfg | Provides the default core configuration of the system that cannot be changed. |
| 2 | $DBND_SYSTEM/databand-system.cfg | Provides middle layer configuration. Use this file to configure project infrastructure. |
| 3 | $DBND_HOME/project.cfg | Provides a project configuration. Use for configuring user-facing parts of the project. |
| 4 | $USER_HOME/.dbnd/databand.cfg | Provides system user configuration. |
You can also create configuration files in a custom location. Use the
DBND__CONF__FILE environment variable to create custom files.
Configuration information that is available in files that is specified in the inferior
configuration layers overrides the configuration that is specified in the superior layers. For
example, you specified config key A in the $DBND_SYSTEM/databand-system.cfg. If in
$DBND_HOME/project.cfg the configuration of A is specified, then Databand uses the
configuration that is specified in the inferior configuration layer, for example
$DBND_HOME/project.cfg. For more information, see ConfigParser.read.
Environment variables in configuration files
You can use $DBND_HOME, $DBND_LIB or
$DBND_SYSTEM in your configuration file, or any other environment variable as shown
in the following section.
[core]
databand_url="${YOUR_ENV_VARIABLE}"
Changing the configuration for a specific section of the code
from dbnd import config
with config({"section": {"key": "value"}}):
passfrom dbnd import config
from dbnd._core.configuration.config_readers import read_from_config_file
with config(read_from_config_file("/path/to/config.cfg")):
passPython SDK advanced configuration
With Python Databand SDK, you can also perform more advanced configuration:
- Passing a list parameters in a .cfg file
- Passing a dictionary parameter in a .cfg file
- Using configuration files with different use cases of files with the default variables overridden in production or test
- Using multiple extra configuration files
- Controlling the output type of
dbnd_config
Passing list parameters in a .cfg file
To pass list parameters in a .cfg file, use the following syntax:
[some_section]
list_param = [1, 2, 3, "str1"]
Passing a dictionary parameter in a .cfg file
To pass dictionary parameters in a .cfg file, use the following syntax:
[some_section]
dict_param = {"key1": "value1", 255: 3}
Using configuration files with different use cases of files with the default variables overridden in production or test
You can also use configuration files, for example files with different use cases or files where some default variables are overridden in production or test. To specify such a file, you can set an environment variable:
export DBND_CONFIG=<extra_file_path>
Using multiple extra configuration files
With -- conf and DBND__DATABAND__CONF variables, you can add
multiple files (such as a list of files separated by comma).
Controlling the output type of dbnd__config
The dbnd_config is a 'dict'-like object that stores only the mapping to
value.
To control the output type of dbnd_config.get ("section", "key"), you can use
getboolean, getint, or getfloat (for permeable
types).
Changing the configuration to control the tracking store behavior when errors occur
The Python Databand SDK uses the tracking system to report the state of the runs
or tasks to the Databand web server. Errors can occur when you report important information, and
these errors can cause invalid states for the runs you see in the Databand webapp.
The tracking system uses a different tracking store, and each
reports the information to a different location, for example:
- Web tracking-store - reports to Databand webserver.
- Console tracking-store - writes the events to the console.
To control the behavior of the tracking system when errors occur, use the
following configurations under the core section:
[core]
remove_failed_store=true
tracker_raise_on_error=true
-
remove_failed_store - The parameter removes a
tracking storeif multiple fails occur. Default value =false. -
max_tracking_store_retries - The parameter defines the maximal amount of retries allowed for a single
tracking storecall if it fails. Default value =2. -
tracker_raise_on_error - The parameter stops the run with an error if a critical error occurs on the tracking like
failing to connect the web-server. Default value =
true.
Changing the configuration to control logging of parameters within decorated functions
The logging configuration options help you save computational resources and protect against reporting of sensitive data, such as full data previews.
Logging data processes and full data quality reports in Databand can be resource-intensive.
However, explicitly turning off all calculations for log_value_size,
log_value_schema, log_value_stats,
log_value_preview, log_value_preview_max_len,
log_value_meta results in valuable metrics not being tracked at all. To help you
better manage logging performance and visibility needs, it's now possible to selectively calculate
and log metadata through a zero-computational-cost approach.
When you use a new configuration, you can decide whether you want to log certain information with
the help of value_reporting_strategy. value_reporting_strategy
changes nothing in your code, but acts as a guard (or fuse) before the value calculation code gets
to execution:
[tracking]
value_reporting_strategy=SMART
The following options are available:
-
ALL - No restrictions for logging. All the log_value_ types are on:
log_value_size,log_value_scheme,log_value_stats,log_value_preview,log_value_preview_max_len,log_value_meta. -
SMART - Restrictions on lazy evaluation types. For types like Spark, values are only calculated when
they are needed. The calculations for lazy evaluation types are restricted and even if you have
log_value_previewset toTrue, when the SMART strategy is on, the Spark previews are not logged. -
NONE - No logging of anything expensive or potentially problematic. This option can be useful if you have some or many values that constitute private and sensitive information, and you don’t want them to be logged.
- <!--
ALL - No restrictions for logging. All the log_value_ types are on:
log_value_size,log_value_scheme,log_value_stats,log_value_preview,log_value_preview_max_len,log_value_meta. -
SMART - Restrictions on lazy evaluation types. For types like Spark, values are only calculated when
they are needed. The calculations for lazy evaluation types are restricted and even if you have
log_value_previewset toTrue, when the SMART strategy is on, the Spark previews are not logged. -
NONE - No logging of anything expensive or potentially problematic. This option can be useful if you have some or many values that constitute private and sensitive information, and you don’t want them to be logged.-->
Most users can benefit from using the SMART option for logging.
The list of available [tracking] configuration parameters
You can add the following parameters to the [tracking] configuration:
-
project - Set the project to which the run is assigned. If you don't set this value, the default project
is used. The tracking server selects a project with
is_default == True. -
databand_external_url - Set a tracker URL to be used for tracking from external systems.
-
log_value_size - Calculate and log the value's size. Enabling this parameter causes a full scan on nonindexable distributed memory objects.
-
log_value_schema - Calculate and log the value's schema.
-
log_value_stats - Calculate and log the value's stats. This parameter is expensive to calculate, so it can be
better to use
log_statson the parameter level. -
log_value_preview - Calculate and log the value's preview. This parameter can be expensive to calculate on Spark.
-
log_value_preview_max_len - Set the max size of the value's preview to be saved at the service. The max value of this parameter is 50000.
-
log_value_meta - Calculate and log the value's meta.
-
log_histograms - Enable calculation and tracking of histograms. This parameter can be expensive.
-
value_reporting_strategy - Set the strategy used for the reporting of values. You have multiple strategy options, each with
different limitations on potentially expensive calculations for
value_meta.ALLremoves all limitations.SMARTlimits lazy evaluation types.NONE, which is the default value, limits everything. -
track_source_code - Enable tracking of function, module, and file source code.
-
auto_disable_slow_size - Enable automatically disabling slow previews for Spark DataFrame with text formats.
-
flatten_operator_fields - Control which of the operator's fields are flattened when tracked.
-
capture_tracking_log - Enable log-capturing for tracking tasks.
[core] configuration section parameter reference
You can add the following parameters to the tracking context by adding configuration by using the
conf parameter of the dbnd_tracking function. This is not
recommended for production usage.
-
databand_url - Set the tracker URL to be used for creating links in the console logs.
-
databand_access_token - Set the personal access token used to connect to the Databand web server.
-
extra_default_headers - Specify extra headers to be used as defaults for
databand_api_client. -
tracker - Set the tracking stores to be used.
-
tracker_api - Set the tracker channels to be used by the 'api' store.
-
debug_webserver - Enable collecting the webserver's logs for each api-call on the local machine. It needs to be supported by the web-server.
-
silence_tracking_mode - Enable silencing the console when in tracking mode.
-
tracker_raise_on_error - Enable raising an error when failed to track data.
-
remove_failed_store - Enable removal of a tracking store if it fails.
-
max_tracking_store_retries - Set the maximum number of retries allowed for a single tracking store call if it fails.
-
client_session_timeout - Set the number of minutes to re-create the api client's session.
-
client_max_retry - Set the maximum number of retries on a failed connection for the api client.
-
client_retry_sleep - Set the amount of sleep time in between retries of the API client.
-
user_configs - Set the config used for creating tasks from the user code.
-
user_init - Runs in every
dbndprocess with the system configuration in place. This is called in DatabandContext after entering initialization steps by the SDK. -
user_driver_init - Runs in a driver after configuration initialization. This is called from DatabandContext when Python runtime is entering a new context.
-
user_code_on_fork - Runs in a sub process, on parallel, Kubernetes, or external modes.
-
plugins - Specify which plug-ins to load on Databand context creations.
-
allow_vendored_package - Enable adding the
dbnd/_vendor_packagemodule to your system path. -
fix_env_on_osx - Enable adding
no_proxy=*to environment variables, fixing issues with multiprocessing on OSX. -
environments - Set a list of enabled environments.
-
dbnd_user - Set which user to connect to the Databand web server. This parameter is deprecated.
-
dbnd_password - Set what password needs to be used to connect to the Databand web server. This parameter is deprecated.
-
tracker_url - Set the tracker URL to be used for creating links in console logs. This parameter is deprecated.