caf.brAIn Configuration Guide#
This document describes the configuration options for the full machine learning pipeline. A populated config file is always required regardless of whether the tool is run in the command line or your IDE.
Configuration File Structure#
The configuration file uses YAML format and consists of four main sections:
paths: Input/output data locationsdata_classification: Dataset structure and column definitionstransforming_inputs: Data preprocessing optionsmodelling: Model selection and training parameters
1. Paths#
Defines the locations of input data and output directories.
paths:
file_path:
folder_path:
output_path:
validation_path:
Parameters#
Parameter |
Type |
Description |
|---|---|---|
file_path |
str or null |
Path to a single CSV or tabular file for model training |
folder_path |
str or null |
Path to folder containing multiple CSVs (same structure) |
output_path |
str or null |
Required. Output directory for all results |
validation_path |
str or null |
Path to validation data containing ground truth |
Notes#
Either
file_pathorfolder_pathmust be provided, not both.All CSVs in
folder_pathmust have identical structures. E.g each CSV represents a different year in your data.Pre-processed train/test data should be placed in
output_path. This can then be read in and used (e.g. train.csv)Arguments that are not populated (null) must be removed from your config file.
2. Data Classification#
Defines the structure of your dataset and identifies column types.
data_classification:
target_column: null
custom_index:
- null
categorical_features:
- null
numerical_features:
- null
weight_column: null
is_time_series: False
Parameters#
Parameter |
Type |
Description |
|---|---|---|
target_column |
str |
Required. Column to predict (dependent variable, Y) |
custom_index |
list[str] or null |
Context columns not used for training (e.g. geography, year) |
categorical_features |
list[str] or null |
Variables with <15 unique values |
numerical_features |
list[str] or null |
Continuous variables with ≥15 unique values |
weight_column |
str or null |
Sample weights for weighted training |
is_time_series |
bool |
Set |
Notes#
Categorical features are automatically encoded (encoding type can be specified).
At least one of
categorical_featuresornumerical_featuresmust be provided (otherwise the model thinks you have no explanatory data (x)).If
is_time_seriesisTrue, include time columns incustom_index.custom_indexcolumns help interpretation but do not affect training.Remove null arguments from your config file.
3. Transforming Inputs#
Controls data preprocessing and train/test splitting.
transforming_inputs:
column_name_to_drop_rows:
- null
value_in_row:
- null
classification_prediction:
- null
split_by_value: null
split_size: null
sample_size_encode: False
select_encode_values: False
encode_values_to_drop: null
Parameters#
Parameter |
Type |
Description |
|---|---|---|
column_name_to_drop_rows |
list[str] or null |
Columns containing values to filter out |
value_in_row |
list[str/int/float] or null |
Values to remove from corresponding columns (in column_name_to_drop_rows) |
classification_prediction |
list[int] or null |
Classes to predict for classification |
split_by_value |
str or null |
Value used as train/test split point |
split_size |
float or null |
Train/test split ratio (default 0.2) |
sample_size_encode |
bool |
Drop most frequent category during encoding |
select_encode_values |
bool |
Manually specify encoding reference values |
encode_values_to_drop |
list[str] or null |
Values to drop when encoding |
Example: Filtering Rows#
column_name_to_drop_rows:
- purpose
- mode
value_in_row:
- 1
- "car"
This removes rows where purpose == 1 and mode == "car".
Notes#
Order matters:
value_in_rowmust matchcolumn_name_to_drop_rows.Leave encoding options as
Falseif unsure (standard encoding is applied as a default).For time series, use
split_by_valuewith a date column (unless time ordering does not matter in your modelling).encode_values_to_dropmust match the order ofcategorical_features.If choosing a custom encoding option, ensure only one is provided e.g
sample_size_encodeto true and removingselect_encode_valuesandencode_values_to_dropfrom the config file.Remove null arguments from your config file.
Example: select_encode_values#
select_encode_values: True
encode_values_to_drop:
- 0
- 1
categorical_features:
- "mode"
- "adults_in_household"
4. Modelling#
Configures model selection and training parameters.
modelling:
model_choice:
- null
full_transformations: False
cv: null
skip_feature_selection: False
intensive_feature_selection: False
Parameters#
Parameter |
Type |
Description |
|---|---|---|
model_choice |
list[Models] |
Required. Models to train |
full_transformations |
bool |
Apply automatic data transformations |
cv |
str or null |
Cross-validation strategy |
skip_feature_selection |
bool |
Skip feature selection entirely |
intensive_feature_selection |
bool |
Apply stricter feature selection |
Available Models#
Regression Models:
random_forest_regressorextra_trees_regressorgradient_boosting_regressoradaboost_regressorbagging_regressorsvrknnridgelassoelasticnetlinear_regressordecision_tree_regresor
Classification Models:
logit_regressor_l1logit_regressor_l2logit_regressor_elasticnetmultinomialgradient_boosting_classifierrandom_forest_classifierextra_trees_classifierdecision_tree_classifiersvm_classifier
Cross-Validation Options#
stratifiedkfoldrepeatedkfoldrepeatedstratifiedkfoldtimeseriessplit(for time series)
Notes#
kfoldis the default cross validation used if no option is provided.Do not mix model types (e.g. one from classification and one from regression).
If
skip_feature_selectionandintensive_feature_selectionare false then a standard feature selection is used.Feature selection will remove features if they are not predictive. If all feature are required, skipping feature selection is advised.
Remove null arguments from your config file.
Example#
modelling:
model_choice:
- random_forest_regressor
- gradient_boosting_regressor
full_transformations: True
cv: timeseriessplit
skip_feature_selection: False
intensive_feature_selection: True
Regression Example Configuration#
paths:
file_path: "data\\input\\training_data.csv"
output_path: "outputs\\model_results"
validation_path: "data\\validation\\validation_2025.csv"
data_classification:
target_column: "trip_count"
custom_index:
- "year"
- "geography"
categorical_features:
- "household_size"
- "car_ownership"
numerical_features:
- "income"
- "population_density"
weight_column: "sample_weight"
is_time_series: True
transforming_inputs:
column_name_to_drop_rows:
- "purpose"
value_in_row:
- 99
split_by_value: "2020"
sample_size_encode: False
select_encode_values: False
modelling:
model_choice:
- random_forest_regressor
- gradient_boosting_regressor
full_transformations: True
cv: timeseriessplit
skip_feature_selection: False
intensive_feature_selection: True
Classification Example Configuration#
paths:
file_path: "data\\input\\training_data.csv"
output_path: "outputs\\model_results"
data_classification:
target_column: 'numcarvan'
custom_index:
- "householdid"
- "soc"
- "ns"
categorical_features:
- "hh_child"
numerical_features:
- "trips"
is_time_series: false
transforming_inputs:
classification_prediction:
- 0
- 1
sample_size_encode: false
select_encode_values: false
modelling:
model_choice:
- logit_regression_elasticnet
- gradient_boosting_classifier
full_transformations: true
skip_feature_selection: false
intensive_feature_selection: true
Quick Start Checklist#
Set
output_path(required)Provide either
file_pathORfolder_pathDefine
target_columnClassify features as
categorical_featuresornumerical_featuresAdd time columns to
custom_indexifis_time_series: True(geography columns are another example of what should be incustom_index)Choose at least one model in
model_choiceUse
cv: timeseriessplitfor time series dataRemove null arguments from your config file
Common Pitfalls#
Forgetting
is_time_series: Truefor temporal dataNot including time columns in
custom_indexfor time seriesMixing regression and classification models
Leaving
target_columnas nullMismatched lengths between
column_name_to_drop_rowsandvalue_in_rowLeaving null values in the config file