Training and Testing your Model

01 Oct 2025

Building a data-driven model is a powerful tool to detect patterns and make predictions of future trends and behaviours. That is why modelling and simulation is increasingly being used in multiple industries to anticipate future outcomes and changes, and therefore to anticipate what their next move should be. In industries where you need high accuracy insights or in those where multiple variables affect a product, then building a model is essential.

Modelling and simulation are powerful tools that can transform raw data into actionable insight. They enable more rapid and accurate diagnosis, reveal behavioural trends, and even support reliable predictions of future outcomes. These techniques are applied across industries wherever understanding and anticipating change creates value.

So, what does it mean to build a model? A data-driven model is about teaching a system to recognise patterns from historic data so that it can predict future trends and behaviours. To train a model the data is typically split into two sets with a ratio of 70:30. 70% of this data is the training data which is used to teach the model the relationship between inputs and outputs. For example, in medical imaging, a model could be trained so that if a particular feature or colour is detected in an MRI scan, then a corresponding diagnosis is made.

Before training the model, the data needs to undergo verification and preparation which also includes cleansing to remove errors or inconsistencies because remember garbage in, garbage out. It could also entail normalisation (e.g., converting to the same metric) to be able to compare like with like. You will also need to ensure the data is in the right format as otherwise it could provide misleading results. These steps are therefore essential so that the model is trained or being "taught" properly, meaning that it is relying on information which is correct. Even the most advanced models can produce misleading results if trained on flawed information.

So, once the training stage is complete, there is the testing stage where you test the remaining 30% of your dataset. The purpose of this is to see and evaluate how well the model is performing on data (that 30% which was not used for the model training) that it has not seen before. When a model is tested, the predicted outputs of the model are compared against the actual values and here, a good model should give predictions that are very close to the actual values.

Popular measures of performance are R² and RMSE. R², or coefficient of determination, is a range between 0 to 1 and values that are close to 1 indicate that predictions are aligning with actual outcomes. Another measure of performance is RMSE (Root Mean Square Error). This measures the average error between the predicted and actual values, with values closer to 0 indicating higher accuracy.

With categorical variables, predictions are inherently limited to the categories present in the training dataset and so you cannot predict outcomes outside of the observed categories/design space of the dataset in a reliable manner. In other words, the model can only interpolate between categories it has already seen, or the model can work based only on the category of data used for training.

On the other hand, with continuous variables, if you assign numerical values to represent a range, for example, mapping different colours to a numeric scale (like for the MRI example above) the model can predict values outside the original training range. This allows extrapolation beyond the observed data, though the accuracy of such predictions should be treated cautiously because it is risky.

With model training you are essentially identifying patterns and trends in data. Patterns make sense if you understand why they occur and most of the times this requires domain knowledge and insights. Once you have identified historical behaviour you can extend it into the future with predictive techniques ranging from statistical tools to advanced machine learning models, the latter especially useful for detecting subtle, non-linear relationships in large datasets. Having said this, predicting future outcomes is always risky, and any predictions should be accompanied by a confidence margin.

So, building a model which predicts trends, results and behaviours is there to help others make informed and smarter decisions. The value lies not only in recognising the trend but also in knowing how to respond and your next steps forward.

Training and Testing your Model

Contact Form