transform_data#

caf.brain.ml.transform_data(data, output_path, categorical_features, numerical_features, target, custom_index=None, weight=None, sample_size_encode=None, select_encode_values=None, encode_values_to_drop=None)[source]#

Encode and or scale data where applicable for machine learning modelling.

Parameters:
  • data (DataFrame) – Pandas Dataframe of your data. Structured or semi-structured tabular format.

  • output_path (Path | str) – Path to output location.

  • categorical_features (list[str] | None) – List of column names (strings) that are categorical variables.

  • numerical_features (list[str] | None) – List of column names (strings) that are continuous variables.

  • target (str) – Column in your data that is the target variable (Y, dependent variable), what you want to predict.

  • custom_index (list[str] | None) – Columns in your data that are to be indexed e.g. year, geography.

  • weight (str | None) – Optional string column value to be used as weight.

  • sample_size_encode (bool | None) – Optional bool. If true, the data will be split based on sample size. Variables with the largest sample size will be used as reference class.

  • select_encode_values (bool | None) – Optional bool. If True, data is split based on custom values set by the user. Corresponds to encode_values_to_drop.

  • encode_values_to_drop (list[str] | None) – If select_encode_values is True, then this must be a list of strings the length of categorical_features. Position one in the list will link to the first variable provided in categorical_features and so on.

Return type:

Pandas dataframe of input data encoded and scaled.