Distance Correlation: Criteria

The Criteria sub-dialog provides control parameters for normalization, and other key options for customizing the distance correlation procedure.

Normalization
Select the option to assign one normalization type on the dataset. Normalization standardizes variable scales, but it might inadvertently distort highly nonlinear relationships in the data, which are crucial for accurately assessing distance correlation. So, this is included to facilitate linear relationships.
  • Min-Max normalization scales each feature to a fixed range, usually [0, 1]. This transformation is useful when your dataset contains variables with different units or ranges. This normalization allows for a fair comparison among variables, ensuring that no single variable dominates the distance calculations due to its scale. It helps mitigate the effects of large numerical ranges.
  • Z-Score normalization centers the data around the mean with a standard deviation of 1. This approach is useful for datasets that are normally distributed. This normalization reduces the influence of outliers, as it gives each variable a standardized scale. It enables easier interpretation of the results, especially when you compare different variables, as all variables are treated equally relative to their distribution.
  • Robust-Scaling uses the median and interquartile range (IQR) to scale the data, making it resilient to outliers. The influence of outliers is minimized, allowing for more reliable distance correlation calculations when extreme values are present. This technique is beneficial when the dataset contains anomalies that could skew results.
  • Log-Transform is designed to reduce skewness in the data, particularly for positively skewed distributions. This can help in stabilizing variance across the dataset. Log transformation compresses large values and expands smaller values, helping to normalize the distribution. This can improve the accuracy of distance correlation analysis, especially when you deal with highly skewed datasets.
Confidence Interval Percentage
Specify the confidence interval percentage to be used in statistical analysis.

The confidence interval percentage indicates the level of certainty associated with statistical estimates. A higher percentage reflects greater confidence in the interval’s accuracy but results in a wider interval, while a lower percentage yields a narrower interval with less certainty

Permutations of Distance Correlation

In distance correlation, permutation testing is used to assess the statistical significance of the observed distance correlation between two variables. The goal is to evaluate whether the observed correlation is significantly different from what could occur by chance. One of the datasets (or both) is permuted randomly, breaking any existing association between the variables. The distance correlation is then recalculated for each permutation, generating a distribution of distance correlation values under the null hypothesis of no association.

Select Set Custom Seed to enable permutation testing and to specify the number of iterations and the significance level. Set the value in Seed. You can optionally set a seed for the random number generator used in permutation testing. The seed must be a positive integer or zero. The default value is 2000000.

Set the Maximum Iterations. The value must be a positive integer. Default value is set to 100 and negative values are not allowed.

By comparing the observed distance correlation with the distribution of distance correlations from permuted datasets, a p-value is computed. The p-value represents the proportion of permuted distance correlations that are greater than or equal to the observed distance correlation. A significance level is used as a threshold for determining statistical significance. If the p-value is below the significance level, the observed correlation is considered statistically significant. Set the value of Significance Level between 0 and 1. The default value is 0.05.