Data Clusters in Machine Learning

A cluster is a collection of similar data. Clustering is a type of unsupervised learning. The correlation coefficient describes the strength of the relationship.

  1. A cluster is a collection of similar data.
  2. Clustering is a type of unsupervised learning.
  3. The correlation coefficient describes the strength of the relationship.

Data clusters

Clusters are sets of data based on similarity.

 

Data points grouped together in a chart can often be categorized into clusters.

In the diagram below, we can distinguish three different clusters:

Picture 1 of Data Clusters in Machine Learning

Identify data clusters

Clusters can contain a wealth of valuable information, but they come in many different shapes, so how can we recognize them?

The two main methods are:

  1. Use visualization
  2. Using clustering algorithms

Clustering

Clustering is a type of unsupervised learning.

Clustering aims to:

  1. Collect similar data into groups.
  2. Collect dissimilar data into other groups.

Clustering methods

  1. Density method
  2. Hierarchical method
  3. Partitioning method
  4. Grid-based method

Density methods consider points in high-density regions to have more similarities and differences compared to points in lower-density regions. Density methods have good accuracy. They also have the ability to merge clusters. Two popular algorithms are DBSCAN and OPTICS.

 

Hierarchical clustering methods create clusters in a tree-like structure. New clusters are formed by utilizing previously formed clusters. Two popular algorithms are CURE and BIRCH.

Grid-based methods format data into a finite number of cells forming a grid structure. Two common algorithms are CLIQUE and STING.

Partitioning methods divide objects into k clusters, with each partition forming a cluster. A common algorithm is CLARANS.

Correlation coefficient

The correlation coefficient (r) describes the strength and direction of the linear relationship between the x/y variables on the scatter plot .

The value of r is always between -1 and +1:

-1.00 Completely downhill Inverse linear relationship.
-0.70 Steep slope Inverse linear relationship.
-0.50 A moderate slope Inverse linear relationship.
-0.30 A gentle slope Inverse linear relationship.
0   There is no linear relationship.
+0.30 Slightly uphill The relationship is linearly positive.
+0.50 Moderate incline The relationship is linearly positive.
+0.70 Steep uphill The relationship is linearly positive.
+1.00 Uphill The relationship is linearly positive.

Full uphill slope +1.00:

Picture 2 of Data Clusters in Machine Learning

A full drop to -1.00:

Picture 3 of Data Clusters in Machine Learning

 

Strong upward slope of +0.61:

Picture 4 of Data Clusters in Machine Learning

There is no linear relationship:

Picture 5 of Data Clusters in Machine Learning

You've just finished reading the article "Data Clusters in Machine Learning" edited by the TipsMake team. You can save data-clusters-in-machine-rsytf.pdf to your computer here to read later or print it out. We hope this article has provided you with many useful tech tips and tricks. You can search for similar articles on tips and guides. Thank you for reading and for following us regularly.

« PREV : Quick guide to...
Communicate with... : NEXT »