TipsMake

Data Clusters in Machine Learning

A cluster is a collection of similar data. Clustering is a type of unsupervised learning. The correlation coefficient describes the strength of the relationship.

  • A cluster is a collection of similar data.
  • Clustering is a type of unsupervised learning.
  • The correlation coefficient describes the strength of the relationship.

Data clusters

Clusters are sets of data based on similarity.

 

Data points grouped together in a chart can often be categorized into clusters.

In the diagram below, we can distinguish three different clusters:

Data Clusters in Machine Learning Picture 1

Identify data clusters

Clusters can contain a wealth of valuable information, but they come in many different shapes, so how can we recognize them?

The two main methods are:

  • Use visualization
  • Using clustering algorithms

Clustering

Clustering is a type of unsupervised learning.

Clustering aims to:

  • Collect similar data into groups.
  • Collect dissimilar data into other groups.

Clustering methods

  • Density method
  • Hierarchical method
  • Partitioning method
  • Grid-based method

Density methods consider points in high-density regions to have more similarities and differences compared to points in lower-density regions. Density methods have good accuracy. They also have the ability to merge clusters. Two popular algorithms are DBSCAN and OPTICS.

 

Hierarchical clustering methods create clusters in a tree-like structure. New clusters are formed by utilizing previously formed clusters. Two popular algorithms are CURE and BIRCH.

Grid-based methods format data into a finite number of cells forming a grid structure. Two common algorithms are CLIQUE and STING.

Partitioning methods divide objects into k clusters, with each partition forming a cluster. A common algorithm is CLARANS.

Correlation coefficient

The correlation coefficient (r) describes the strength and direction of the linear relationship between the x/y variables on the scatter plot .

The value of r is always between -1 and +1:

-1.00 Completely downhill Inverse linear relationship.
-0.70 Steep slope Inverse linear relationship.
-0.50 A moderate slope Inverse linear relationship.
-0.30 A gentle slope Inverse linear relationship.
0   There is no linear relationship.
+0.30 Slightly uphill The relationship is linearly positive.
+0.50 Moderate incline The relationship is linearly positive.
+0.70 Steep uphill The relationship is linearly positive.
+1.00 Uphill The relationship is linearly positive.

Full uphill slope +1.00:

Data Clusters in Machine Learning Picture 2

A full drop to -1.00:

Data Clusters in Machine Learning Picture 3

 

Strong upward slope of +0.61:

Data Clusters in Machine Learning Picture 4

There is no linear relationship:

Data Clusters in Machine Learning Picture 5

Discover more

Machine Learning
Micah Soto

Share by

Micah Soto
Update 10 March 2026