This is the technique used to group data together. In this approach we identify clusters around some points known as centroids. These clusters then are known as different categories or classes. Say, we have to identify two clusters out of the data given, we will first create two centers randomly ( assign step) and then optimize these points such that they form the center points to the clusters using some optimization techniques (Optimization step). We will discuss the process in detail in this post below.

There are two stages in k-means clustering approach.

- Assign ( assign points to a centroid)
- Optimize (move centroid)

The biggest challenge with k-means is your clusters may be totally different based on initial placement of centroids. You may like to build several such clusters and then do ensembling.

Questions to ask are:

- How to select number of centroids in the data??
- What is good placement of centroid to start algorithm??
- How many iterations of assign and optimize you wants to perform??

Limitations:

- Given a fixed set of training data and fixed number of clusters, output may not be always same. (as it depend on the initial placement of the centroids)
- local minimum problem: (Local hidden climbing algorithm)

References:

- Visualize K-means
- https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/

**Single Linkage Clustering (SLC) or ****Hierarchical **Agglomerative

**Hierarchical**

**Clustering (HAC)**:

**Clustering (HAC)**:

- consider each object as a cluster itself
- define inter cluster distance as distance between closest two points in two clusters
- merge two closest clusters
- iterate n-k times to form k clusters ( n is total number of points)

SLC is deterministic.

**Average Link Clustering Algo**: take average distance between two cluster (point 2 above)

**Max Link Clustering Algo**: take distance from farthest two points in two clusters (point 2)

**Soft Clustering:**

Think of the scenario, where a point lies exactly in between two cluster. So with SLC, it might end up with any of the two clusters, based on how you start your algorithm.