Clustering is a type of unsupervised learning method and a common technique for statistical data analysis used in many fields. Clustering mainly is a task of dividing the set of observations into subsets, called clusters, in such a way that observations in the same cluster are similar in one sense and they are dissimilar to the observations in other clusters. In simple words, we can say that the main goal of clustering is to group the data on the basis of similarity and dissimilarity.
For example, the following diagram shows similar kind of data in different clusters:
K-Means algorithm
K-means clustering algorithm is one of the well-known algorithms for clustering the data. We need to assume that the numbers of clusters are already known. This is also called flat clustering. It is an iterative clustering algorithm. The steps given below need to be followed for this algorithm:
Step 1: We need to specify the desired number of K subgroups.
Step 2: Fix the number of clusters and randomly assign each data point to a cluster. Or in other words we need to classify our data based on the number of clusters.
In this step, cluster centroids should be computed. As this is an iterative algorithm, we need to update the locations of K centroids with every iteration until we find the global optima or in other words the centroids reach at their optimal locations.
The following code will help in implementing K-means clustering algorithm in Python. We are going to use the Scikit-learn module.
Let us import the necessary packages:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
The following line of code will help in generating the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=500, centers=4,
cluster_std=0.40, random_state=0)
X, y_true = make_blobs(n_samples=500, centers=4,
cluster_std=0.40, random_state=0)
We can visualize the dataset by using the following code:
plt.scatter(X[:, 0], X[:, 1], s=50);
plt.show()
plt.show()
Here, we are initializing kmeans to be the KMeans algorithm, with the required parameter of how many clusters (n_clusters).
kmeans = KMeans(n_clusters=4)
We need to train the K-means model with the input data.
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
The code given below will help us plot and visualize the machine's findings based on our data, and the fitment according to the number of clusters that are to be found.
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.show()
plt.show()
Here I am ending this post, we'll continue with clustering and discuss Mean Shift Algorithm in the next post.
0 comments:
Post a Comment