merging methods - single link, complete link, group average
Partition Clustering
let d be distance between two instance
select k random seeds
until convergence or stop condition
for each x in nodes
assign x to cluster c_j with d(x, c_j) is the smallest
for each c_i in clusters
s_i = centroid(c_i)
set fixed number of clusters before algorithm starts
different cluster assignment for different runs
faster than HAC
How to get the best of both worlds?
Buckshot algorithm
Use HAC to select k clusters
Select $ \sqrt{k \cdot n} $ documents
Apply HAC until we get k clusters (complexity $ O(k \cdot n) $)
Apply K-means to create document classifier
Much more robust initial set of seeds
Linear complexity
Online Document Search
Scatter / Gather
Needs fast clustering algorithm to categorize documents in user friendly time
Separate documents to initial clusters (Scatter)
use words with highest weight in whole group to create description
User selects clusters she is interested in
Group clusters together (Gather)
Do reclustering on merged groups (Scatter)
An Illustration
Search 5000 articles from New York Times collected during August 1990
User wants to find out what happened in August?
Conventional algorithms don't work because of:
Vague search query, e.g. What happened in August?
User does not know words to describe topic
Words to describe topic might not be used to discuss topic, e.g. international event