On Clustering Documents Unsupervisedly
A few thoughts and experiences on clustering documents with no supervision.
Say you have 1,000,000+ documents that you want to cluster. Your goal is to explore if there are similar documents that can be hopefully grouped together, which implies the following assumptions:
- Undefined number of clusters; ideally below 1,000; preferably 30~300.
- 90% documents are unique and cannot be clustered; they can be seen as noise, or might form a single huge cluster, depending on how you look at it.
First of all, clean the data. Specifically, normalize frequently-used tokens. My favorite routine is to normalize URLs, dates, etc., using
TextPreProcessor
from ekphrasis
. If text originates from a formal, business/governmental source, normalization alone would be sufficient -- unpacking hashtags and correcting spellings won't be necessary.
I'm a huge word vector guy, so I'll first vectorize all documents into 300-dimensional vectors. Notice that I do not deduplicate documents before training Doc2Vec models, so that the Doc2Vec algorithm can have a more accurate feeling of which words are used together. ("Not applicable." repeating a million times means quite a lot for the two words, compared to this phrase only appearing once, right?)
After training the Doc2Vec model, you can now group by cleaned text, count occurrences, and save that as a
weight
for each unique text.For a noisy dataset with unknown geometry, I would always go for
DBSCAN
(you can get a gut feeling why here). If you have only a couple of CPU cores to spare, try
HDBSCAN
. This is a variation of DBSCAN
, does not seem to parallelize very well (at least not with a convenient n_jobs
parameter I can tamper with), and appears to outperform (read: halve computation time) sklearn.cluster.DBSCAN
: However, if you are working on a 24-core server, I found that parallelism is king -- I would rather go with
sklearn.cluster.DBSCAN
with n_jobs
tuned up to full thrust (-1
) rather than sticking with HDBSCAN
.To be fair,
HDBSCAN
has a core_dist_n_jobs
option for parallelization, but it seems to apply to only a small part of the computation, not contributing much to the overall time-saving.However,
DBSCAN
got its own problem, too. It's memory complexity is O(n^2)
and does not support caching (compared to agglomerative clustering), rendering memory error even on a 250GB memory machine. The best I can do for now is to sub-sample my dataset to 10,000
documents and work from there.Some have proposed using
OPTICS
as a replacement, but -- same case with HDBSCAN
and agglomerative clustering -- its fit
method does not support sample_weight
: This implies that, to get accurate clustering, you cannot deduplicate documents at all. In my case, that means 7x the amount of data. Not realistic.
Last modified 3yr ago