Classification based on categorical data
Asked Answered
E

3

1

I have a dataset

Inp1    Inp2        Output
A,B,C   AI,UI,JI    Animals
L,M,N   LI,DO,LI    Noun
X,Y     AI,UI       Extras

For these values, I need to apply a ML algorithm. Which algorithm would be best suited to find relations in between these groups to assign an output class to them?

Excisable answered 30/4, 2022 at 12:38 Comment(0)
A
1

Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.

import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]

dataframe = pd.DataFrame(A[1:], columns=A[0])

def my_encoding(row):
    encoded_row = []
    for ls in row:
        encoded_ls = []
        for s in ls:
            sbytes = s.encode('utf-8')
            sint = int.from_bytes(sbytes, 'little')
            encoded_ls.append(sint)
        encoded_row.append(encoded_ls)
    return encoded_row

print(dataframe.apply(my_encoding))

output:

           Inp1  ...               Output
0  [65, 66, 67]  ...  [32488788024979009]
1  [76, 77, 78]  ...         [1853189966]

if my assumptions are incorrect or this is not what you're looking for let me know.

Alroi answered 3/5, 2022 at 18:15 Comment(0)
L
1

As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.

Requested format:

Inp1     Inp2    Inp3      Output
7,44,87  4,65,2  47,36,20  45

This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.

Suggesting format:

A  B  C  L  M  N  X  Y  AI  DO  JI  LI  UI  Apple  Bat  Dog  Lawn  Moon  Noon  Yemen  Zombie
1  1  1  0  0  0  0  0   1   0   1   0   1      1    1    1     0     0     0      0       0
0  0  0  1  1  1  0  0   0   1   0   1   0      0    0    0     1     1     1      0       0
0  0  0  0  0  0  1  1   1   0   0   0   1      0    0    0     0     0     0      1       1

Hereafter you can label encode / ohe the output field as per your model requires.

Happy learning !

Lousewort answered 9/5, 2022 at 5:42 Comment(2)
Hi, How can I get the OHE format as stated above?Excisable
@Excisable one easy way : pandas.pydata.org/docs/reference/api/pandas.get_dummies.htmlLousewort
B
1

BCE is for multi-label classifications, whereas categorical CE is for multi-class classification where each example belongs to a single class. In your task you need to understand if for a single example you end in a single class only (CE) or single example may end in multiple classes (BCE). Probable the second is true since animal can be a noun. ;)

Bushire answered 11/8, 2022 at 7:44 Comment(4)
Yes Multiple classes can be assigned. Can I get any leads regarding which algorithm can be suitable for a kind of example above based on a training data.Excisable
These algorithms are chosen based on targets (outputs) with the premise of supervised ML. Training data is irrelevant as well you can encode you features (inputs).Bushire
Can you please elaborate moreExcisable
OK, but this will be my last update. ML models cannot work directly with text. They convert text to numbers somehow.Bushire

© 2022 - 2024 — McMap. All rights reserved.