how to convert pandas multiple columns of text into tensors?

About

Asked 18/7, 2021 at 13:9 Answered 18/7, 2021 at 19:54

Solved machine-learning deep-learning nlp data-preprocessing

Hi I am working on Key Point Analysis Task, which is shared by IBM, here is the link. In the given dataset there are more than one rows of text and anyone can please tell me how can I convert the text columns into tensors and again assign them in the same dataFrame because there are other columns of data there.

Problem

Here I am facing a problem that I have never seen this kind of data before like have multiple text columns, How can I convert all those columns into tensors and then apply a model. Most of the time data is like : One Text Column and other columns are label, Example: Movie Reviews , Toxic Comment classification.

def clean_text(text):
"""
    text: a string

    return: modified initial string
"""
text = text.lower()  # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ',
                               text)  
text = BAD_SYMBOLS_RE.sub('',
                          text)  
text = text.replace('x', '')
#    text = re.sub(r'\W+', '', text)
text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
return text

Verism answered 18/7, 2021 at 13:9 Comment(6)

Have you tried to use Word2vec for converting texts into tensors?-I think it would work. – Foulup 18/7, 2021 at 13:25

@HakanAkgün can you suggest me any article on that ? – Verism 18/7, 2021 at 16:55

This is the Word2vec model gensim library link :radimrehurek.com/gensim/models/word2vec.html . However İf you are going to train a model to predict something,than I suggest you also check the huggingface's Roberta and Bert pretrained tokenizers:huggingface.co/transformers/model_doc/roberta.html – Foulup 18/7, 2021 at 18:1

yeah I have read the details but its the model...I don't think this is what I am looking for. But if you still thinks its the best choice can you please take a look at the data and then kindly suggest ... I have attached the link in the post. – Verism 18/7, 2021 at 18:53

You are trying to replace texts with their embeddings, right? If yes I can provide an example of it with those models. – Foulup 18/7, 2021 at 18:55

@HakanAkgün Thank you so much... Reply it with an answer I am waiting for you sir. – Verism 18/7, 2021 at 19:38

If I got your question right you will do sth like the following:

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
DF["args"]=DF["args"].apply(lambda x:tokenizer(x)['input_ids'])

This will convert sentences into token arrays.

Uniocular answered 18/7, 2021 at 19:54 Comment(12)

ummm yes I think that would definitely work. Thank you so much – Verism 18/7, 2021 at 19:59

Sir can you please explain what is ['input_ids'] in lambda functions? – Verism 18/7, 2021 at 20:1

Basically,RobertaTokenizer returns a dict consisting "input_ids", & "attention_mask" so taking ["input_ids"] is basically chosing tokens. (You are welcome btw) – Foulup 18/7, 2021 at 22:8

What if you have tabular data like above how would you apply a classification model on that? – Verism 2/9, 2021 at 6:48

@irfan Yeah I came across the same problem how can you apply the Model in these scenarios. – Surculose 2/9, 2021 at 6:51

You can add padding with tf module or another module. after adding padding and making these arrays the same length, you can just fit a model (probably a model including LSTMs or RNNs) – Foulup 2/9, 2021 at 8:57

@HakanAkgün Can you give the link to the article which is explaining how to do that? Or Update the answer with some more code. – Surculose 2/9, 2021 at 9:52

There is a good explanation for the sequence classifications with LSTMs in the following link, you can check it out machinelearningmastery.com/…. Hope that helps. – Foulup 2/9, 2021 at 13:16

@HakanAkgün sir thanks for the lecture and also i have studied this machinelearningmastery.com/… and this explains how can we add make the swquences same lenth. And my question was there are different column of of text that we converted them into tensors like now we have different columns of tensors which i converted with the code you privided. How to combine them and apply lstm. – Surculose 7/9, 2021 at 8:53

If columns have the same shape arrays in themselves. For example column1-> always (x,) shape, column2-->always (y,) shape, column3--> always(z,) shape then you can concatenate them and create a matrix with shape (x+y+z,) and then feed it into your LSTM layer. (If I understood your question correctly then this should answer it.) – Foulup 7/9, 2021 at 13:50

@HakanAkgün ohh bro seriously you are a mind reader. I was looking for it. Now it will help me to find the article where I can understand how to create a matrix with x+y+z. If you have this kind of article please let me know. – Surculose 10/9, 2021 at 6:28

Concatenation can be done with different libraries but the most reasonable one for this case is np.concatenate before feeding or using tf.keras.layers.Concatenate(If you know how to use TensorFlow's functional API) Rather than these I don't have a specific article about this. – Foulup 10/9, 2021 at 13:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Problem

Recommended topics

Hot tags