Clustering of news articles

Asked 10/8, 2014 at 11:39 Answered 12/8, 2014 at 2:28

machine-learning nlp cluster-analysis information-retrieval unsupervised-learning

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.

What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:

1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).

2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.

In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.

(EDIT: forgot to link to related question of mine)

Fritts answered 10/8, 2014 at 11:39 Comment(4)

I don't think there's a single "best way" to perform this task; in fact, from the way you posed your question, you could potentially apply dozens of different algorithms and obtain qualitatively similar results. Did you read any paper on text categorization? – Vedic 10/8, 2014 at 12:50

Yeah, I started reading about similarity measures, (unsupervised) learning / clustering, and related topics. But in parallel I would also like just to try some things out -- you know, "learning by doing" or if it happens "learning by burning/failing". I don't expect THE one best way to do. However it seems like a rather common task, and I was hoping for some best-practice approaches. – Fritts 11/8, 2014 at 3:19

I really want to know how you clustering your dataset because I have a set of resumes that I want to cluster and categories them and am pretty new to the field how can I contact you sorry because I'm not answering here but you are my last hope _ sorry again – Emptyhanded 23/2, 2016 at 11:57

@Abeerzaroor -- See this StackOverflow question of mine with a minimal, now working example. It's essentially just a simplified version of the more elaborate example on the SciKit-Learn website. Both links should help you to start rolling. – Fritts 24/2, 2016 at 1:44

Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.

These algorithms do not need the number of clusters as input.

The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.

Gratify answered 10/8, 2014 at 20:51 Comment(2)

Debasis, I will look into that! Are there any off-the-shell tools or libraries I can use to some first ideas? – Fritts 11/8, 2014 at 3:22

You can use Weka.. It has both the single link and the complete link implementations of HAC... weka.sourceforge.net/doc.dev/weka/clusterers/… – Gratify 11/8, 2014 at 9:26

You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).

http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.

Warthog answered 11/8, 2014 at 14:52 Comment(1)

For the time being I just trying to play around with off-the-shell algorithms/libraries to get some idea how stuff works and what results to expect. But you're right, in the long run the set of documents will be dynamic. I already started reading about incremental solutions of clustering algorithms. – Fritts 11/8, 2014 at 23:24

Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:

Official thesis description and code base

Gephi plugin for MCL (to experiment and evaluate the method)

Inchon answered 12/8, 2014 at 2:28 Comment(2)

This look pretty interesting, thanks! I installed the Gephi plugin and applied it on some sample data. – Fritts 14/8, 2014 at 1:10

Christian, I'd be interested in knowing more about your experiment and results with graph-based approaches and otherwise. – Inchon 17/8, 2014 at 17:20

Recommended topics

Hot tags