Data Clusters in Machine Learning

Table of Contents

Data clusters
Identify data clusters
Clustering
Clustering methods
Correlation coefficient

A cluster is a collection of similar data.
Clustering is a type of unsupervised learning.
The correlation coefficient describes the strength of the relationship.

Data clusters

Clusters are sets of data based on similarity.

Data points grouped together in a chart can often be categorized into clusters.

In the diagram below, we can distinguish three different clusters:

images 1 of Data Clusters in Machine Learning

Identify data clusters

Clusters can contain a wealth of valuable information, but they come in many different shapes, so how can we recognize them?

The two main methods are:

Use visualization
Using clustering algorithms

Clustering

Clustering is a type of unsupervised learning.

Clustering aims to:

Collect similar data into groups.
Collect dissimilar data into other groups.

Clustering methods

Density method
Hierarchical method
Partitioning method
Grid-based method

Density methods consider points in high-density regions to have more similarities and differences compared to points in lower-density regions. Density methods have good accuracy. They also have the ability to merge clusters. Two popular algorithms are DBSCAN and OPTICS.

Hierarchical clustering methods create clusters in a tree-like structure. New clusters are formed by utilizing previously formed clusters. Two popular algorithms are CURE and BIRCH.

Grid-based methods format data into a finite number of cells forming a grid structure. Two common algorithms are CLIQUE and STING.

Partitioning methods divide objects into k clusters, with each partition forming a cluster. A common algorithm is CLARANS.

Correlation coefficient

The correlation coefficient (r) describes the strength and direction of the linear relationship between the x/y variables on the scatter plot .

The value of r is always between -1 and +1:

-1.00	Completely downhill	Inverse linear relationship.
-0.70	Steep slope	Inverse linear relationship.
-0.50	A moderate slope	Inverse linear relationship.
-0.30	A gentle slope	Inverse linear relationship.
0		There is no linear relationship.
+0.30	Slightly uphill	The relationship is linearly positive.
+0.50	Moderate incline	The relationship is linearly positive.
+0.70	Steep uphill	The relationship is linearly positive.
+1.00	Uphill	The relationship is linearly positive.