evaluate_data#

caf.brain.ml.evaluate_data(data, output_path, categorical_features, numerical_features, target, classification_prediction, weight=None, allow_transformations=True, is_time_series=False)[source]#

Data analysis for structured tabular numerical data.

Where applicable, the following tests are performed:

Multicolinearity (VIF)
Heteroscedasticity (Breusch-Pagan & White Test)
Linearity (Correlation coefficient)
Normality (Shapiro-Wilk)
Autocorrelation (Durbin-Watson)

Data transformations (log and scaling) are applied to numerical features only if permitted and required (to fix issues the tests reveal).

Parameters:

data (DataFrame) – Pandas Dataframe of your data. Structured or semi-structured tabular format. This data should not yet be scaled or encoded.
output_path (Path | str) – Path to output location.
target (str) – Column in your data that is the target variable (Y, dependent variable), what you want to predict.
weight (str | None) – Optional string column value to be used as weight.
categorical_features (list[str] | None) – List of column names (strings) that are categorical variables.
numerical_features (list[str] | None) – List of column names (strings) that are continuous variables.
classification_prediction (tuple[int, ...] | None) – List of integers that correspond to the target column. The value(s) to predict in a classification problem.
allow_transformations (bool) – Whether to apply transformations.
is_time_series (bool) – If true then data must be time series. Time series based characteristics are taken into consideration during function execution.

Returns:

train_transformed – training data post transformation.
test_transformed – test data post transformation.

Return type:

tuple[DataFrame, DataFrame]