Deep Dive Density-Based Clustering!

DBSCAN algorithm has been explained in this article with an in-depth explanation of a project!

A request to everyone that please go through my article on Clustering before going through this, because it will be a great help for you guys as it will help in understanding the basics of clustering!

Introduction to Density-based clustering (DBSCAN) algorithm!

It stands for Density-Based Spatial Clustering of Applications with Noise.

It works based on density of points in a particular radius. It is one of the most commonly used algorithms. It works best when there is spatial data present, or there is noise present in the data.

It has 2 parameters on which it works:

1. Radius: It is the radius of the neighborhood, if it includes enough points then it is known as Dense Area. This parameter is set in the DBSCAN algorithm by using epsparameter of the DBSCAN algorithm. Here “eps” means epsilon.

2. Minimum number of neighbors: These are the min. no. of points that should be present in the radius to define a cluster. This parameter is set in the DBSCAN algorithm by using its min_samples parameter. Here “min_samples” means the minimum number of samples to be taken into consideration for a dense area.

DBSCAN has the ability to form arbitrarily shaped clusters, which other clustering algorithms lack in.

Types of points in DBSCAN

In the internal working of DBSCAN, there are 3 types of points generated which are:

  1. Core Point: This category of points includes those points, which are having at least the minimum number of points around them in the given radius to form a cluster including itself. We can also consider a point as a core point if it has (n-1) points in the given radius excluding itself, where “n” is the minimum number of points which have to be present in a radius to consider that part as a dense area.

For example, if 7 points are required to make a cluster as dense, and there is a point in the given radius having 6 more in the boundary of the radius, then that point will be considered as a core point.

2. Boundary Points: This category of points includes those points, which do not have the minimum number of points around them in the given radius to form a cluster.

For example, if 7 points are required to make a cluster as dense, and there is a point in the given radius having less than 6 points in the boundary of the radius, then that point along with all the other points in that radius will be considered as the boundary points.

3. Outliers: This category of points include those points which are not included in any of the clusters.

Cluster formation in DBSCAN

It is a very simple & basic process. It involves the iterative & continuous process of combining the small clusters into big ones until there are no more clusters left for the combination.

Two clusters are combined if there is another core point present within the radius boundary of a core point. This process continues until there is no core point left to be clustered. With core points, obviously boundary points falling into the cluster of a single core point are also combined into bigger clusters.

This algorithm separates outliers and they can be easily identified, which is a rare feature, & it is not possessed by any other clustering algorithm category.

Because of the above-explained cluster combination process, arbitrarily shaped clusters are formed.

Why choose Density-Based Clustering?

It has advantages over other algorithms in some conditions.

Other clustering algorithms like K-Means, hierarchical clustering, fuzzy clustering, etc. can be used to cluster data without supervision. But, when it comes to the arbitrarily shaped clustering, or the concept of a cluster within a cluster, other above-mentioned clustering algorithms may fail, or not provide good results.

Another scenario in which Density-Based Clustering proves their performance is in the case of finding the regions of high density & separating it from the regions of low density, which other clustering algorithms are very unlikely to perform.

Also, it enables us to easily identify outliers which acts as noise in the data, & then according to our requirement, we can easily remove them. On the other hand, the K-Means algorithm will definitely assign each & every point to a cluster, even if they are outlier (do not belong to any of the clusters).

Advantages of DBSCAN

  • It can form arbitrarily shaped clusters.
  • It is robust to the outliers, i.e. it can easily identify them.
  • It does not require the number of clusters to be specified, it automatically finds the optimal number of clusters according to the parameters (epsilon & minimum samples) specified.
  • It can easily work in the condition of a cluster within a cluster condition/scenario.
  • It can separate regions of high-density from low-density.

Example point plot on Map

The image below shows the example plot of points, which can further be clustered using DBSCAN.

The project of clustering weather-station data using DBSCAN!

Importing the required libraries for the project!

At line number “18”, SimpleImputer has been initialized which will make the base for filling the missing values in columns of the dataset.

At line number “21”, ColumnTransformer has been initialized, which has been used to fill the missing values in each column specified by the list “a” in ColumnTransformer. Parameter “remainder” has been set to “passthrough”, which signifies that, if there is any other column present in the data frame other than the columns specified in the list “a”, do not touch them.

At line number “29”, data has been transformed by calling the transformer constructed above.

For more detailed information on this project, check out this project on Github by clicking on the link given below!

I strongly suggest that before directly jumping on this project, you should have a look at the project which also involves the DBSCAN algorithm, but it is been implemented on randomly generated data. Link for the project given below.

I hope my article explains each and everything related to Density-Based clustering along with the explanation of the project. Thank you so much for investing your time in reading my article and boosting your knowledge!

Big Data Enthusiast, have a demonstrated history of delivering large and complex projects. Interested in working in the field of AI and Data Science.