Generating dictionaries to categorize tweets into pre-defined categories using NLTK
Asked Answered
R

1

8

I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.

As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) :
I am trying to generate dictionaries of common words under each category so that I can use it for classification.

Is there a method to generate these dictionaries for a custom set of words automatically?

Then I can use these for classifying the twitter data using a tf-idf classifier and get the degree of correspondence of the tweet to each of the categories. The highest value will give us the most probable category of the tweet.

But since the categorisation is based on these pre-generated dictionaries, I am looking for a way to generate them automatically for a custom list of categories.

Sample dictionaries :

Education - ['book','teacher','student'....]

Automobiles - ['car','auto','expo',....]

Example I/O:

**Input :** 
UserA - "students visited share learning experience eye opening 
article important preserve linaugural workshop students teachers 
others know coding like know alphabets vision driving codeindia office 
initiative get students tagging wrong people apologies apologies real 
people work..."
.
.
UserN - <another corpus of cleaned tweets>


**Expected output** : 
UserA - Education (61%)
UserN - Automobiles (43%)
Representative answered 23/2, 2020 at 6:5 Comment(10)
If you can provide the input your are working with & the output you expect as a text, there is a higher probability of getting a suggestion/solution.Jodijodie
@Jodijodie I have added examples to clarify.Representative
Do you ask how to build dictionaries of a specific topic (e.g. education)? Something like topic-related headwords? Or do you ask for how to extract those topic-related words from the tweets in your corpus?Waldrup
I am asking how I can build these dictionaries of topic related headwords.Representative
You could use most frequent lemmatized tokens from texts which you know to be related to the specific topic. The way you try to classify the texts sounds like tf-idf would be the correct way for categorizing anyways, such that you get a rating like in a search engine for each tweet compared to the bag of words of each class. Or use cosine similarity of word vectors. A different approach would be annotation of a subset of your data, meaning that you classify some of the texts and then compare them to the rest of your corpus for classifying them.Waldrup
@Waldrup Yes I have planned the same approach using tf-idf classifier and then cosine similarity for further enhancement. The initial challenge I am facing here is to find a source from where I can fetch a big enough list of lemmatized tokens on any topic so that the model is robust. Instead of randomly picking up any document on sports , automobiles etc. for building these bag of words.Representative
A colleague of mine is doing research on this. He is using an approach called Latent Dirichlet allocation. After some research, I think this is exactly what you are looking for: datacamp.com/community/tutorials/lda2vec-topic-modelWaldrup
The better question is this, "If the final labels doesn't care what words are in the sentence, why do you have to extract a dictionary of words that belongs to a category?". Then the expected response, "Because I don't have labelled data and I want to create labelled data for supervised learning using unsupervised manner." Finally, we ask, "Then is supervised learning the right task to perform on your dataset if you don't have labels?"Designing
@Designing I have been having the concerns. What alternative approach do you suggest?Representative
It is not clear how you would use the dictionary. If the tweets are big enough, you can label (by hand and with the help of another algorithm) the tweet and then feed the labels and the tweet into BERT or word2vec and apply a scikit learn KMeans, see scikit-learn.org/stable/tutorial/machine_learning_map/…Combinative
D
11

TL;DR

Labels are necessary for supervised machine learning. And if you don't have training data that contains Xs (input texts) and Y (output labels) then (i) supervised learning might not be what you're looking for or (ii) you have to create a dataset with texts and their corresponding labels.

In Long

Lets try to break it down and see reflect what you're looking for.

I have a list twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology

So your ultimate task is to label tweets into 7 categories.

I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.

100 data points is definitely insufficient to do anything if you want to train a supervised machine learning model from scratch.

Another thing is the definition of corpus. A corpus is a body of text so it's not wrong to call any list of strings a corpus. However, to do any supervised training, each text should come with the corresponding label(s)

But I see some people do unsupervised classification without any labels!

Now, that's an oxymoron =)

Unsupervised Classification

Yes, there are "unsupervised learning" which often means to learn representation of the inputs, generally the representation of the inpus is use to (i) generate or (ii) sample.

Generation from a representation means to create from the representation a data point that is similar to the data which an unsupervised model has learnt from. In the case of text process / NLP, this often means to generate new sentences from scratch, e.g. https://transformer.huggingface.co/

Sampling a representation means to give the unsupervised model a text and the model is expected to provide some signal from which the unsupervised model has learnt from. E.g. given a language model and novel sentence, we want to estimate the probability of the sentence, then we use this probability to compare across different sentences' probabilities.

Algorithmia has a nice summary blogpost https://algorithmia.com/blog/introduction-to-unsupervised-learning and a more modern perspective https://sites.google.com/view/berkeley-cs294-158-sp20/home

That's a whole lot of information but you don't tell me how to #$%^&-ing do unsupervised classification!

Yes, the oxymoron explanation isn't finished. If we look at text classification, what are we exactly doing?

We are fitting the input text into some pre-defined categories. In your case, the labels are pre-defined but

Q: Where exactly would the signal come from?

A: From the tweets, of course, stop distracting me! Tell me how to do classification!!!

Q: How do you tell the model that a tweet should be this label and not another label?

A: From the unsupervised learning, right? Isn't that what unsupervised learning supposed to do? To map the input texts to the output labels?

Precisely, that's the oxymoron,

Supervised learning maps the input texts to output labels not unsupervised learning

So what do I do? I need to use unsupervised learning and I want to do classification.

Then the question is ask is:

How about all these AI I keep hearing about, that I can do classification with 3 lines of code.

Don't they use unsupervised language models that sounds like Sesame Street characters, e.g. ELMO, BERT, ERNIE?

I guess you mean something like https://github.com/ThilinaRajapakse/simpletransformers#text-classification

from simpletransformers.classification import ClassificationModel
import pandas as pd


# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)

eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)

# Create a ClassificationModel
model = ClassificationModel('bert', 'bert-base') # You can set class weights by using the optional weight argument

# Train the model
model.train_model(train_df)

Take careful notice of the comment:

Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.

Yes that's the more modern approach to:

  • First use a pre-trained language model to convert your texts into input representations
  • Then feed the input representations and their corresponding labels to a classifier

Note, you still can't avoid the fact that you need labels to train the supervised classifier

Wait a minute, you mean all these AI I keep hearing about is not "unsupervised classification".

Genau. There's really no such thing as "unsupervised classification" (yet), somehow the (i) labels needs to be manually defined, (ii) the mapping between the inputs to the labels should exist

The right word to define the paradigm would be transfer learning, where the language is

  • learned in a self-supervised manner (it's actually not truly unsupervised) so that the model learns to convert any text into some numerical representation

  • then use the numerical representation with labelled data to produce the classifier.

Designing answered 25/2, 2020 at 1:47 Comment(1)
Great answer! I would just like to add that, with unsupervised learning, you could map the inputs (in this case, texts) to different groups with intrinsic differences, but they won't be labeled at all nor have any meaning beyond the one that the user gives them (the algorithm could tell that some texts belong to different groups, but not the meaning of this groups or "deduce" some labels for them based on the content)Sapphirine

© 2022 - 2024 — McMap. All rights reserved.