
09 Apr 2025
When working with large datasets, visualisation is key to gaining insights. This is also important when presenting to other business stakeholders. It makes all the difference when data is presented in a clear and meaningful way.
Complex datasets do not need to be overwhelming. Today we explore the concept of clustering - how to identify patterns in unstructured and unlabelled data.
Data Collection is an important part of data analysis and visualisation. If you collect your data in a wrong way, it can lead to a misleading interpretation. Data needs to then be sorted out. The same type of data is then placed together. In the LEGOs image below, you see the LEGO pieces of different colours grouped together. Not only you need to sort out the data, but also arrange them, i.e. for instance convert the data so that the data is uniform and can be compared and used e.g. formatting, unit conversion etc. Data is then presented in a way which is understandable to analytical and non-analytical internal (and possibly external) stakeholders.
Remember, in an organisation, some functions, who might not be analytical in nature, might need to be able to read and understand the data for strategy and/or decision-making. Once data is presented visually, it needs to be analysed and explained, and hence one can reach an outcome.

There are many visualisation methods one can use, from bar charts to scatter plots, but let’s take a more scientific approach to visualisation, which is mostly used when you have large unstructured datasets to work with; Clustering visualisation.
Supervised clustering is when you are grouping your data according to datapoints which you have defined. These datapoints are defined by understanding and finding a pattern or common element, in unstructured and unlabelled datasets.
So let’s take an easy example and imagine that we have the following data:
Cat, Dog, Kitchen, Donkey, Sofa, Wardrobe, Door, Table, Horse, Bird, Chair.
We immediately understand there are two Clusters which are furniture (let’s call it Cluster A) and animals (Cluster B). So, all of the above data will be grouped around the datasets we established, either Cluster A for furniture or Cluster B for animals.
