caf.brAIn Configuration Guide#

This document describes the configuration options for the full machine learning pipeline. A populated config file is always required regardless of whether the tool is run in the command line or your IDE.

Configuration File Structure#

The configuration file uses YAML format and consists of four main sections:

paths: Input/output data locations
data_classification: Dataset structure and column definitions
transforming_inputs: Data preprocessing options
modelling: Model selection and training parameters

1. Paths#

Defines the locations of input data and output directories.

paths:
    file_path:
    folder_path:
    output_path:
    validation_path:

Parameters#

Parameter	Type	Description
file_path	str or null	Path to a single CSV or tabular file for model training
folder_path	str or null	Path to folder containing multiple CSVs (same structure)
output_path	str or null	Required. Output directory for all results
validation_path	str or null	Path to validation data containing ground truth

Notes#

Either file_path or folder_path must be provided, not both.
All CSVs in folder_path must have identical structures. E.g each CSV represents a different year in your data.
Pre-processed train/test data should be placed in output_path. This can then be read in and used (e.g. train.csv)
Arguments that are not populated (null) must be removed from your config file.

2. Data Classification#

Defines the structure of your dataset and identifies column types.

data_classification:
    target_column: null
    custom_index:
      - null
    categorical_features:
      - null
    numerical_features:
      - null
    weight_column: null
    is_time_series: False

Parameters#

Parameter	Type	Description
target_column	str	Required. Column to predict (dependent variable, Y)
custom_index	list[str] or null	Context columns not used for training (e.g. geography, year)
categorical_features	list[str] or null	Variables with <15 unique values
numerical_features	list[str] or null	Continuous variables with ≥15 unique values
weight_column	str or null	Sample weights for weighted training
is_time_series	bool	Set `True` for temporal data

Notes#

Categorical features are automatically encoded (encoding type can be specified).
At least one of categorical_features or numerical_features must be provided (otherwise the model thinks you have no explanatory data (x)).
If is_time_series is True, include time columns in custom_index.
custom_index columns help interpretation but do not affect training.
Remove null arguments from your config file.

3. Transforming Inputs#

Controls data preprocessing and train/test splitting.

transforming_inputs:
    column_name_to_drop_rows:
      - null
    value_in_row:
      - null
    classification_prediction:
      - null
    split_by_value: null
    split_size: null
    sample_size_encode: False
    select_encode_values: False
    encode_values_to_drop: null

Parameters#

Parameter	Type	Description
column_name_to_drop_rows	list[str] or null	Columns containing values to filter out
value_in_row	list[str/int/float] or null	Values to remove from corresponding columns (in column_name_to_drop_rows)
classification_prediction	list[int] or null	Classes to predict for classification
split_by_value	str or null	Value used as train/test split point
split_size	float or null	Train/test split ratio (default 0.2)
sample_size_encode	bool	Drop most frequent category during encoding
select_encode_values	bool	Manually specify encoding reference values
encode_values_to_drop	list[str] or null	Values to drop when encoding

Example: Filtering Rows#

column_name_to_drop_rows:
  - purpose
  - mode
value_in_row:
  - 1
  - "car"

This removes rows where purpose == 1 and mode == "car".

Notes#

Order matters: value_in_row must match column_name_to_drop_rows.
Leave encoding options as False if unsure (standard encoding is applied as a default).
For time series, use split_by_value with a date column (unless time ordering does not matter in your modelling).
encode_values_to_drop must match the order of categorical_features.
If choosing a custom encoding option, ensure only one is provided e.g sample_size_encode to true and removing select_encode_values and encode_values_to_drop from the config file.
Remove null arguments from your config file.

Example: select_encode_values#

select_encode_values: True
encode_values_to_drop:
  - 0
  - 1
categorical_features:
  - "mode"
  - "adults_in_household"

4. Modelling#

Configures model selection and training parameters.

modelling:
    model_choice:
        - null
    full_transformations: False
    cv: null
    skip_feature_selection: False
    intensive_feature_selection: False

Parameters#

Parameter	Type	Description
model_choice	list[Models]	Required. Models to train
full_transformations	bool	Apply automatic data transformations
cv	str or null	Cross-validation strategy
skip_feature_selection	bool	Skip feature selection entirely
intensive_feature_selection	bool	Apply stricter feature selection

Available Models#

Regression Models:

random_forest_regressor
extra_trees_regressor
gradient_boosting_regressor
adaboost_regressor
bagging_regressor
svr
knn
ridge
lasso
elasticnet
linear_regressor
decision_tree_regresor

Classification Models:

logit_regressor_l1
logit_regressor_l2
logit_regressor_elasticnet
multinomial
gradient_boosting_classifier
random_forest_classifier
extra_trees_classifier
decision_tree_classifier
svm_classifier

Cross-Validation Options#

stratifiedkfold
repeatedkfold
repeatedstratifiedkfold
timeseriessplit (for time series)

Notes#

kfold is the default cross validation used if no option is provided.
Do not mix model types (e.g. one from classification and one from regression).
If skip_feature_selection and intensive_feature_selection are false then a standard feature selection is used.
Feature selection will remove features if they are not predictive. If all feature are required, skipping feature selection is advised.
Remove null arguments from your config file.

Example#

modelling:
    model_choice:
        - random_forest_regressor
        - gradient_boosting_regressor
    full_transformations: True
    cv: timeseriessplit
    skip_feature_selection: False
    intensive_feature_selection: True

Regression Example Configuration#

paths:
    file_path: "data\\input\\training_data.csv"
    output_path: "outputs\\model_results"
    validation_path: "data\\validation\\validation_2025.csv"

data_classification:
    target_column: "trip_count"
    custom_index:
      - "year"
      - "geography"
    categorical_features:
      - "household_size"
      - "car_ownership"
    numerical_features:
      - "income"
      - "population_density"
    weight_column: "sample_weight"
    is_time_series: True

transforming_inputs:
    column_name_to_drop_rows:
      - "purpose"
    value_in_row:
      - 99
    split_by_value: "2020"
    sample_size_encode: False
    select_encode_values: False

modelling:
    model_choice:
        - random_forest_regressor
        - gradient_boosting_regressor
    full_transformations: True
    cv: timeseriessplit
    skip_feature_selection: False
    intensive_feature_selection: True

Classification Example Configuration#

paths:
    file_path: "data\\input\\training_data.csv"
    output_path: "outputs\\model_results"

data_classification:
    target_column: 'numcarvan'
    custom_index:
        - "householdid"
        - "soc"
        - "ns"
    categorical_features:
        - "hh_child"
    numerical_features:
        - "trips"
    is_time_series: false

transforming_inputs:
    classification_prediction:
        - 0
        - 1
    sample_size_encode: false
    select_encode_values: false

modelling:
    model_choice:
        - logit_regression_elasticnet
        - gradient_boosting_classifier
    full_transformations: true
    skip_feature_selection: false
    intensive_feature_selection: true

Quick Start Checklist#

Set output_path (required)
Provide either file_path OR folder_path
Define target_column
Classify features as categorical_features or numerical_features
Add time columns to custom_index if is_time_series: True (geography columns are another example of what should be in custom_index)
Choose at least one model in model_choice
Use cv: timeseriessplit for time series data
Remove null arguments from your config file

Common Pitfalls#

Forgetting is_time_series: True for temporal data
Not including time columns in custom_index for time series
Mixing regression and classification models
Leaving target_column as null
Mismatched lengths between column_name_to_drop_rows and value_in_row
Leaving null values in the config file