K-означает кластеризацию

Набор данных для этого алгоритма можно найти по следующей ссылке: https://github.com/ruthvikraja/mnist (Размер каждого изображения в приведенном выше наборе данных составляет 28 х 28)

Цель состоит в том, чтобы скорректировать входные изображения и находить лучшие 10 изображений, которые близки к каждому кластеру от его центроида ……

K-означает кластеризацию кластеризации – это методика, которая используется для разделения N наблюдений в K), в которых каждое наблюдение принадлежит кластеру с ближайшим средним значением. Это один из простейших неповторимых алгоритма и использует метрику расстояния для поиска ближайшего центрароида.

Algorithm Workflow: Шаг 1: Случайно выберите K точек в виде кластерных центров. Шаг 2: Вычислить расстояния и группу ближайших. Шаг 3: Вычислить новый средний и повторить шаг 2. Шаг 4: Если изменение незначительно (или), если нет переназначения наблюдений к другим кластерам (или), если какие-либо критерии остановки соответствуют, то процесс завершается.

Таким образом, следующее является кодом Python для реализации K-означает кластеризацию: –

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import cv2

## Reading a numpy file:
# This can be done using load function that is present in numpy library

x_train=np.load("/Users/ruthvikrajam.v/Desktop/ML/train_data.npy")
y_train=np.load("/Users/ruthvikrajam.v/Desktop/ML/train_labels.npy")

x_test=np.load("/Users/ruthvikrajam.v/Desktop/ML/test_data.npy")
y_test=np.load("/Users/ruthvikrajam.v/Desktop/ML/test_labels.npy")

# Since k means clustering is an unsupervised learning the x_train and x_test i.e the independent variables can be concatenated to fit the model
 # and also y_train and y_test can be concatenated. This can be done using the concatenate function that is present in numpy library by sending input arrays as a tuple:

x=np.concatenate((x_train,x_test))
y=np.concatenate((y_train,y_test))

# Now let us define K-Means clustering where the number of clusters is 10 and the number of iterations is 20

# The below K-Means class accepts the number of clusters that has to be formed, several other optional parameters and the input array values for fitting into the model:

kmeans=KMeans(n_clusters=10, max_iter=20, n_jobs=-1) # random_state=0 means everytime the same set of random centroid values are taken, n_jobs=-1 means using all the processors that are present in the local machine
# Here the stopping criteria would be the maximum number of iterations for a single run

kmeans.fit(x) # Fitting the input images to predict the clusters

x_predict=kmeans.predict(x)
# (or) fit_predict() can also be used
# x_predict=kmeans.fit_predict(x)

kmeans.labels_ # This will prints all the predicted labels

centers=kmeans.cluster_centers_ # Centroid of each cluster is stored

# Computing 10 nearest images for each cluster:
for c in range(0,10):
 l=[]
 for i in x_predict:
    if i==c:    
        l.append(True)
    else:
        l.append(False)

 cluster6= x[l] # Separating the data points with cluster label c

# Now let us see some random images the cluster with label c has clustered:

#img=np.reshape(x[12],(28,28))
#plt.imshow(img,cmap="gray")

# Now let us compute the distances between the centroid of each cluster and its corresponding data points to find the top 10 data points that is closer to the centroid.

 l6=[]
 for i in cluster6:
    l6.append(np.linalg.norm(centers[c] - i)) # This will compute the distance between each data point and the centroid of its corresponding cluster

 l6=np.array(l6)
 l6_index=np.argsort(l6) # This will compute the indices of each distance value in an ascending order 

 l6_top10=l6_index[0:10] # This will gives the top 10 indices values which are closer to the centroid 

 l_images=[]
 for j in l6_top10:
    l_images.append(cluster6[j])

 l_images=np.array(l_images) # These are the final top 10 images that are closer to the centroid 

#img=np.reshape(l_images[0],(28,28))
#plt.imshow(img,cmap="gray")   

# Let us display all the 10 images:

 fig=plt.figure(figsize=(10, 7)) # Initially we have to create a figure object, then we can add subplots in it

 rows=5 # Defining 5 rows and 2 columns to accomodate 10 images 
 columns=2

 fig.add_subplot(rows, columns, 1) # Here the number 1 denotes the position of our image in the figure 
 plt.imshow(np.reshape(l_images[0],(28,28)),cmap="gray")
 plt.axis('off') #This will set axis values = off

 fig.add_subplot(rows, columns, 2) 
 plt.imshow(np.reshape(l_images[1],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 3) 
 plt.imshow(np.reshape(l_images[2],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 4) 
 plt.imshow(np.reshape(l_images[3],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 5) 
 plt.imshow(np.reshape(l_images[4],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 6) 
 plt.imshow(np.reshape(l_images[5],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 7) 
 plt.imshow(np.reshape(l_images[6],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 8) 
 plt.imshow(np.reshape(l_images[7],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 9) 
 plt.imshow(np.reshape(l_images[8],(28,28)),cmap="gray")
 plt.axis('off')

 fig.add_subplot(rows, columns, 10) 
 plt.imshow(np.reshape(l_images[9],(28,28)),cmap="gray")
 plt.axis('off')
 print("Images with the cluster label",c);

Оригинал: “https://dev.to/ruthvikraja_mv/k-means-clustering-1idg”