Skip to content

大数据分析笔记3-聚类 #7

@SSK015

Description

@SSK015

https://ssk015.github.io/dashuju3/

Click here Slide(pdf)

What is Clustering?
That is to say, given a set of points, We can define a concept of distance between these points. Then, we group the points into some number of clusters, which is known as “簇” in Chinese.

distance : mainly Euclidean or Jaccard

Why is Clustering hard?
Too many dimensions: isolated points

Two main methods:
Hierarchical: bottom up and top down
Assign: assign points to a Existing cluster

Hierarchical(between clusters):
note:

represent a cluster.
determine the nearness of clusters.
when to stop merging clusters.
Euclidean case:

centroid(average)
(1) distance of centroids
(2) shortest distance between two clusters

UnEuclidean case:

Approach 1

choose Clustroid(a exisiting point)

maxium/average/square of dis

various distance and cohesion measures
Approach 2

the collection of points.
define inter-cluster distance.
min of two or avg of all pairs.
Approach 3

the collection of points.
define a notion of cohesion, merge similiar unions.
diamter, avg dis, density
3.
design 1: convex clusters
design 2: concentric clusters.

K-Means(Assignment)

definition: a method, not a algorithm.

method: Before convergence(all the points don’t move), assign points and update centroids

select k: try different k, get the value when avg dis to centroid stop changing dramatically.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions