how to convert pandas multiple columns of text into tensors?
Asked Answered
V

1

1

Hi I am working on Key Point Analysis Task, which is shared by IBM, here is the link. In the given dataset there are more than one rows of text and anyone can please tell me how can I convert the text columns into tensors and again assign them in the same dataFrame because there are other columns of data there. enter image description here

Problem

Here I am facing a problem that I have never seen this kind of data before like have multiple text columns, How can I convert all those columns into tensors and then apply a model. Most of the time data is like : One Text Column and other columns are label, Example: Movie Reviews , Toxic Comment classification.

def clean_text(text):
"""
    text: a string

    return: modified initial string
"""
text = text.lower()  # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ',
                               text)  
text = BAD_SYMBOLS_RE.sub('',
                          text)  
text = text.replace('x', '')
#    text = re.sub(r'\W+', '', text)
text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
return text
Verism answered 18/7, 2021 at 13:9 Comment(6)
Have you tried to use Word2vec for converting texts into tensors?-I think it would work.Foulup
@HakanAkgün can you suggest me any article on that ?Verism
This is the Word2vec model gensim library link :radimrehurek.com/gensim/models/word2vec.html . However İf you are going to train a model to predict something,than I suggest you also check the huggingface's Roberta and Bert pretrained tokenizers:huggingface.co/transformers/model_doc/roberta.htmlFoulup
yeah I have read the details but its the model...I don't think this is what I am looking for. But if you still thinks its the best choice can you please take a look at the data and then kindly suggest ... I have attached the link in the post.Verism
You are trying to replace texts with their embeddings, right? If yes I can provide an example of it with those models.Foulup
@HakanAkgün Thank you so much... Reply it with an answer I am waiting for you sir.Verism
U
3

If I got your question right you will do sth like the following: Prior Data

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
DF["args"]=DF["args"].apply(lambda x:tokenizer(x)['input_ids'])

This will convert sentences into token arrays.

enter image description here

Uniocular answered 18/7, 2021 at 19:54 Comment(12)
ummm yes I think that would definitely work. Thank you so muchVerism
Sir can you please explain what is ['input_ids'] in lambda functions?Verism
Basically,RobertaTokenizer returns a dict consisting "input_ids", & "attention_mask" so taking ["input_ids"] is basically chosing tokens. (You are welcome btw)Foulup
What if you have tabular data like above how would you apply a classification model on that?Verism
@irfan Yeah I came across the same problem how can you apply the Model in these scenarios.Surculose
You can add padding with tf module or another module. after adding padding and making these arrays the same length, you can just fit a model (probably a model including LSTMs or RNNs)Foulup
@HakanAkgün Can you give the link to the article which is explaining how to do that? Or Update the answer with some more code.Surculose
There is a good explanation for the sequence classifications with LSTMs in the following link, you can check it out machinelearningmastery.com/…. Hope that helps.Foulup
@HakanAkgün sir thanks for the lecture and also i have studied this machinelearningmastery.com/… and this explains how can we add make the swquences same lenth. And my question was there are different column of of text that we converted them into tensors like now we have different columns of tensors which i converted with the code you privided. How to combine them and apply lstm.Surculose
If columns have the same shape arrays in themselves. For example column1-> always (x,) shape, column2-->always (y,) shape, column3--> always(z,) shape then you can concatenate them and create a matrix with shape (x+y+z,) and then feed it into your LSTM layer. (If I understood your question correctly then this should answer it.)Foulup
@HakanAkgün ohh bro seriously you are a mind reader. I was looking for it. Now it will help me to find the article where I can understand how to create a matrix with x+y+z. If you have this kind of article please let me know.Surculose
Concatenation can be done with different libraries but the most reasonable one for this case is np.concatenate before feeding or using tf.keras.layers.Concatenate(If you know how to use TensorFlow's functional API) Rather than these I don't have a specific article about this.Foulup

© 2022 - 2024 — McMap. All rights reserved.