Online clustering of news articles - McMap

About

Online clustering of news articles

Asked 3/4, 2018 at 20:43 Answered 6/4, 2018 at 19:46

machine-learning nlp cluster-analysis information-retrieval unsupervised-learning

S

1

7

Is there a common online algorithm to classify news dynamically? I have a huge data set of news classified by topics. I consider each of that topics a cluster. Now I need to classify breaking news. Probably, I will need to generate new topics, or new clusters, dynamically.

The algorithm I'm using is the following:

1) I go through a group of feeds from news sites and I recognize news links.

2) For each new link, I extract the content using dragnet, and then I tokenize it.

3) I find the vector representation of all the old news and the last one using TfidfVectorizer from sklearn.

4) I find the nearest neighbor in my dataset computing euclidean distance from the last news vector representation and all the vector representations of the old news.

5) If that distance is smaller than a threshold, I put it in the cluster that the neighbor belongs. Otherwise, I create a new cluster, with the breaking news.

Each time a news arrive, I re-fit all the data using a TfidfVectorizer, because new dimensions can be founded. I can't wait to re-fit once per day, because I need to detect breaking events, which can be related to unknown topics. Is there a common approach more efficient than the one I am using?

Samons answered 3/4, 2018 at 20:43 Comment(6)

It does not even work reliably off-line, and you want an online algorithm already? – Arsenide 6/4, 2018 at 19:6

yes, the algorithm I'm using works off-line. tfidf vectorization with knn clusterization is a common approach and it is well know that works fine. I don't understand why you are giving me a bad vote to my question, I'm researching another topic, online clustering, and I need some ideas. – Samons 6/4, 2018 at 19:28

There is no "knn clustering". Only kNN classification. – Arsenide 6/4, 2018 at 19:31

ok thanks for your comments! – Samons 6/4, 2018 at 19:40

I had downvoted because the question wasn't self contained, and thus likely not useful for future visitors. You have improved the question now, so I un-downvoted. Nevertheless, the "clustering" you do is still unreproducible. The second document will have the first as nearest neighbor, and everything is the same "cluster" (whatever a cluster is here anyway) – Arsenide 6/4, 2018 at 19:43

ok I'm going to explain it better – Samons 6/4, 2018 at 19:46

A

3

If you build the vectorization yourself, adding new data will be much easier.

You can trivially add new words as new columns that are simply 0 for all earlier documents.
Don't apply the idf weights, but use them as dynamic weights only.

There are well known, and very fast, implementations of this.

For example Apache Lucene. It can add new documents online, and it uses a variant of tfidf for search.

Arsenide answered 6/4, 2018 at 19:46 Comment(1)

Ok, that apporach could be very useful! – Samons 6/4, 2018 at 19:54

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.