
03 Sept 2025
A model is as good as the data behind it. In the following article using the same function, six different subsets from the same dataset were tested. Each of the subsets was fitted with quadratic, cubic and quartic curves. At first the curves looked like a great fit but looking in closely to the predictions the story changes.
So with this we learn that even a perfect fit can be misleading - without enough well-distributed data points, you risk choosing the wrong model, making the wrong predictions and taking the wrong decisions.
In modelling work the importance of a robust dataset cannot be overstated or highlighted enough. Remember, as they say, garbage in garbage out - if you work with weak or bad data it skews predictions and derails decisions. If the data is flawed needless to say that the predictions will be too and so will any decisions that follow.
A robust dataset not only requires a sufficient number of datapoints but also an even distribution across the full range of interest. The example in the image above illustrates a toy case of y=x^2 * noise. Here the noise is included deliberately, as real word data collection is never perfect and inevitably contains a degree of variability.
As you can see in the image, the same dataset was processed in 6 different ways as follows;
top left = few datapoints, lower range
top right = few datapoints, upper range
middle left = sparse datapoints, spread across whole range
middle right = few datapoints, mid range
bottom left = few datapoints, concentrated at the ends of the range
bottom right = full dataset
Next a quadratic, cubic and quartic curve were fitted to the different datasets and the results are very different! Let's take the one on the top left where the errors (measured as the sum of the absolute difference between actual y and predicted y) were almost negligible, therefore all 3 curves pass (near) exactly through the datapoints. Having said that, the real story appears at the upper end of x. The predicted values start to diverge sharply and not just from each other but also from the predictions made using the complete dataset (bottom right). It is a clear reminder that even when a model fits perfectly to the data at hand, it may still tell a very different story outside the range you've measured.
When the full dataset is used, all three fitted curves end up looking almost identical, but with the reduced datasets the picture changes dramatically as the quality of the predictions vary ranging from completely off-base shape with wildly inaccurate predictions (top right) to curves that look reasonable but still not very good predictions as they drift noticeably from the results of the full dataset (middle left).
What's striking is that many of these curves appear to fit the data well.The problem is that without enough points spread across the full range, there is no reliable way to tell which fit is actually correct. That uncertainty can easily lead to the wrong model and wrong predictions or conclusions. That is why having a solid dataset isn't just useful, it's essential.
