Clustering#

What are clustering methods?#

A cluster is a group of things that are close together. Clustering is a very important idea in machine learning in general. Clustering work well with feature vectors, which is a vector of numbers representing features. Because feature vectors that are close to one another is supposed to be related in a deeper way (because of closer features), clustering is very useful in determining whether two data features are in the same group (same cluster).

Clustering clustering methods#

Most clustering methods are related to the following clustering methods:

K-means#

K-means is a kind of partition based methods, which means that it partition the (feature) space into different regions, and predicts arbitrary points in the (feature) space by locating which partition the point falls into.

K-means is one of the most popular clustering methods, because of how simple it is. Its algorithm is as follows:

  1. Select initial center points. These points will act as the center of the classes later.

  2. Assign class labels to each point using the rule that a point is assigned the label of the closest center.

  3. Shifting the center to be the mean of all the points that have the same class label as the center.

  4. Repeat.

DBSCAN#

DBSCAN, abbreviation for Density-Based Spatial Clustering of Applications with Noise, is a density based method, which means that it clusters the points based on densities. If there are many points in a certain region, the algorithm simply assumes that they are all of the same base class.

GMM#

Gaussian Mixure Models, introduced previously, can do more than generating data. The Gaussians can fit existing data, and be used to explain clusters with probabilities. A GMM, composed of several Gaussians, can be used to fit the data, and the end results will be several Gaussians, which act as clusters.

Hierarchy methods#

Treating every point as its independent cluster, we combine multiple points into the same cluster and do that until there is one big cluster left. We can then see the hierarchy of the problem: later grouped clusters are inherently farther away form each other than earlier grouped clusters.