TipsMake
Newest

Data Clusters in Machine Learning

  1. A cluster is a collection of similar data.
  2. Clustering is a type of unsupervised learning.
  3. The correlation coefficient describes the strength of the relationship.

Data clusters

Clusters are sets of data based on similarity.

 

Data points grouped together in a chart can often be categorized into clusters.

In the diagram below, we can distinguish three different clusters:

Data Clusters in Machine Learning Picture 1

Identify data clusters

Clusters can contain a wealth of valuable information, but they come in many different shapes, so how can we recognize them?

The two main methods are:

  1. Use visualization
  2. Using clustering algorithms

Clustering

Clustering is a type of unsupervised learning.

Clustering aims to:

  1. Collect similar data into groups.
  2. Collect dissimilar data into other groups.

Clustering methods

  1. Density method
  2. Hierarchical method
  3. Partitioning method
  4. Grid-based method

Density methods consider points in high-density regions to have more similarities and differences compared to points in lower-density regions. Density methods have good accuracy. They also have the ability to merge clusters. Two popular algorithms are DBSCAN and OPTICS.

 

Hierarchical clustering methods create clusters in a tree-like structure. New clusters are formed by utilizing previously formed clusters. Two popular algorithms are CURE and BIRCH.

Grid-based methods format data into a finite number of cells forming a grid structure. Two common algorithms are CLIQUE and STING.

Partitioning methods divide objects into k clusters, with each partition forming a cluster. A common algorithm is CLARANS.

Correlation coefficient

The correlation coefficient (r) describes the strength and direction of the linear relationship between the x/y variables on the scatter plot .

The value of r is always between -1 and +1:

-1.00 Completely downhill Inverse linear relationship.
-0.70 Steep slope Inverse linear relationship.
-0.50 A moderate slope Inverse linear relationship.
-0.30 A gentle slope Inverse linear relationship.
0   There is no linear relationship.
+0.30 Slightly uphill The relationship is linearly positive.
+0.50 Moderate incline The relationship is linearly positive.
+0.70 Steep uphill The relationship is linearly positive.
+1.00 Uphill The relationship is linearly positive.

Full uphill slope +1.00:

Data Clusters in Machine Learning Picture 2

A full drop to -1.00:

Data Clusters in Machine Learning Picture 3

 

Strong upward slope of +0.61:

Data Clusters in Machine Learning Picture 4

There is no linear relationship:

Data Clusters in Machine Learning Picture 5

Discover more Machine Learning
Micah Soto
Share by Micah Soto
Update 10 March 2026