
09 Apr 2025
When working with large datasets, visualisation is key to gaining insights. This is also important when presenting to other business stakeholders. It makes all the difference when data is presented in a clear and meaningful way.
Complex datasets do not need to be overwhelming. Today we explore the concept of clustering - how to identify patterns in unstructured and unlabelled data.
Data Collection is an important part of data analysis and visualisation. If you collect your data in a wrong way, it can lead to a misleading interpretation. Data needs to then be sorted out. The same type of data is then placed together. In the LEGOs image below, you see the LEGO pieces of different colours grouped together. Not only you need to sort out the data, but also arrange them, i.e. for instance convert the data so that the data is uniform and can be compared and used e.g. formatting, unit conversion etc. Data is then presented in a way which is understandable to analytical and non-analytical internal (and possibly external) stakeholders.
Remember, in an organisation, some functions, who might not be analytical in nature, might need to be able to read and understand the data for strategy and/or decision-making. Once data is presented visually, it needs to be analysed and explained, and hence one can reach an outcome.

There are many visualisation methods one can use, from bar charts to scatter plots, but let’s take a more scientific approach to visualisation, which is mostly used when you have large unstructured datasets to work with; Clustering visualisation.
Supervised clustering is when you are grouping your data according to datapoints which you have defined. These datapoints are defined by understanding and finding a pattern or common element, in unstructured and unlabelled datasets.
So let’s take an easy example and imagine that we have the following data:
Cat, Dog, Kitchen, Donkey, Sofa, Wardrobe, Door, Table, Horse, Bird, Chair.
We immediately understand there are two Clusters which are furniture (let’s call it Cluster A) and animals (Cluster B). So, all of the above data will be grouped around the datasets we established, either Cluster A for furniture or Cluster B for animals.
If to this data I add a candleholder, then this data will be somewhere outside of the range of these clusters, because it is neither furniture nor an animal, however it will be closer to Cluster A (furniture) than it is to Cluster B (animals).
If then we add a glass bowl to the dataset, this too, like the candleholder, would be outside of the range. Having said that, the glass bowl might be slightly further away from Cluster A, then a candleholder would be. This is because a domestic fish could live in a glass bowl and so there is a linkage, albeit not a strong one, there.

Understanding a pattern is crucial when attributing data points for clusters. In machine learning for instance, clustering is about grouping raw data. There are many applications for clustering across many industries, from fraud detection in banking and anomaly detection in healthcare, to market segmentation and many more.
Let’s take another simple example of how to make sense of unstructured data. Imagine we are to analyse a phone numbers' list and receive this data:
729698782172106674475298921152340587
What we know for sure is that phone numbers will start either with 7 (in case of a mobile phone number) or with 5 (in case of a landline). Mobile numbers and phone numbers are of different lengths, but each type will always contain the same number of numbers. Furthermore, the area code (normally found in the first few digits of a phone number) has to be a common number, since this data comes from the same geographical area. Looking at the data, we have identified that the only common number in the above which follows either a 5 or a 7 is the number 2. We identified an equal length to both the mobile numbers (10 digits) and the phone numbers (8 digits). With this knowledge we can split and structure the datasets as below - the first two from the below list are mobile phone numbers and the second two are landline numbers.
7296987821
7210667447
52989211
52340587
The larger and more complex the data, the more important it is to visualise it. If you have lots of data to show for interpretation, you simply have to visualise it to make sense of it. Visualisation is simply the key to let your data help you and to make your data count.