evaluate_data#

caf.brain.ml.evaluate_data(data, output_path, categorical_features, numerical_features, target, classification_prediction, weight=None, allow_transformations=True, is_time_series=False)[source]#

Data analysis for structured tabular numerical data.

Where applicable, the following tests are performed:
  • Multicolinearity (VIF)

  • Heteroscedasticity (Breusch-Pagan & White Test)

  • Linearity (Correlation coefficient)

  • Normality (Shapiro-Wilk)

  • Autocorrelation (Durbin-Watson)

Data transformations (log and scaling) are applied to numerical features only if permitted and required (to fix issues the tests reveal).

Parameters:
  • data (DataFrame) – Pandas Dataframe of your data. Structured or semi-structured tabular format. This data should not yet be scaled or encoded.

  • output_path (Path | str) – Path to output location.

  • target (str) – Column in your data that is the target variable (Y, dependent variable), what you want to predict.

  • weight (str | None) – Optional string column value to be used as weight.

  • categorical_features (list[str] | None) – List of column names (strings) that are categorical variables.

  • numerical_features (list[str] | None) – List of column names (strings) that are continuous variables.

  • classification_prediction (tuple[int, ...] | None) – List of integers that correspond to the target column. The value(s) to predict in a classification problem.

  • allow_transformations (bool) – Whether to apply transformations.

  • is_time_series (bool) – If true then data must be time series. Time series based characteristics are taken into consideration during function execution.

Returns:

  • train_transformed – training data post transformation.

  • test_transformed – test data post transformation.

Return type:

tuple[DataFrame, DataFrame]