sklearn utils compute_class_weight function for large dataset

Asked 26/2, 2020 at 7:34 Answered 24/11, 2023 at 9:48

Solved python tensorflow machine-learning scikit-learn data-science

I am training a tensorflow keras sequential model on around 20+ GB text based categorical data in a postgres db and i need to give class weights to the model. Here is what i am doing.

class_weights = sklearn.utils.class_weight.compute_class_weight('balanced', classes, y)

model.fit(x, y, epochs=100, batch_size=32, class_weight=class_weights, validation_split=0.2, callbacks=[early_stopping])

Since i can't load the whole thing in memory i figured i can use fit_generator method in keras model.

However how can i calculate the class weights on this data? sklearn does not provide any special function for this, is it the right tool for this ?

I thought of doing it on multiple random samples but is there a better approach where whole data can be used ?

Eventide answered 26/2, 2020 at 7:34 Comment(0)

You can use the generators and also you can compute the class weights.

Let's say you have your generator like this

train_generator = train_datagen.flow_from_directory(
        'train_directory',
        target_size=(224, 224),
        batch_size=32,
        class_mode = "categorical"
        )

and the class weights for the training set can be computed like this

class_weights = class_weight.compute_class_weight(
           'balanced',
            np.unique(train_generator.classes), 
            train_generator.classes)

[EDIT 1] Since you mentioned about postgres sql in the comments, I am adding the prototype answer here.

first fetch the count for each classes using a separate query from postgres sql and use it to compute the class weights. you can compute it manually. The basic logic is the count of least weighed class gets the value 1, and the rest of the classes get <1 based on the relative count to the least weighed class.

for example you have 3 classes A,B,C with 100,200,150 then class weights becomes {A:1,B:0.5,C:0.66}

let compute it manually after fetching the values from postgres sql.

[Query]

cur.execute("SELECT class, count(*) FROM table group by classes order by 1")
rows = cur.fetchall()

The above query will return rows with tuples (class name, count for each class) ordered from least to highest.

Then the below line will code will create the class weights dictionary

class_weights = {}
for row in rows:
    class_weights[row[0]]=rows[0][1]/row[1] 
    #dividing the least value the current value to get the weight, 
    # so that the least value becomes 1, 
    # and other values becomes < 1

Sclerous answered 26/2, 2020 at 7:38 Comment(2)

This method seems to take class names from directory names. I am working on text data. If i write my own generator to yield values read from a postgres db connection, i need to iterate over them, however compute_class_weight expects whole list of training labels. Is there any way in to deal with this? – Eventide 26/2, 2020 at 11:4

you should have mentioned about postgres in your question. how about writing a query to postrges db to get the count of each class ? (select count(*) from table group by classes) and get the result and use it to compute class weights ? – Sclerous 27/2, 2020 at 1:46

Sklearn isn't used for large processing like this. Ideally, we must implement it ourselves especially when it's a part of a pipeline you are running regularly.

Erlindaerline answered 24/11, 2023 at 9:48 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Volsci 26/11, 2023 at 2:39

Recommended topics

Hot tags