max_df corresponds to documents than min_df error in Ridge classifier
Asked Answered
H

2

11

I trained the ridge classifier with a huge amount of data ,used tfidf vecotrizer to vectorize data and it used to work fine. But now i am facing an error

'max_df corresponds to < documents than min_df'

The data is stored in Mongodb.
I tried various option to solve it and and finally when i deleted a collection in Mongodb which had only 1 document (1 record), it worked normally and completed the training as usual.

But I need a solution which does not require deleting the record as I need that record.

Also, I am not understanding the error as it is only in my machine.The script used to work fine before in my system even while this record was present in the db.The script is working fine in other system as well.

Could someone help please?

Hrutkay answered 3/10, 2016 at 9:26 Comment(0)
C
22

That error is telling you that your max_df value is less than the min_df value. For example:

max_df = 0.7 # Removes terms with DF higher than the 70% of the documents

min_df = 5 # Terms must have DF >= 5 to be considered

and suppose that the total number of documents in your corpus is 7, so max_df now is 0.7*7 = 4.9 and min_df still is 5, then max_df < min_df, and that should never happen because that means that 0 terms will be considered; never a term has DF lower than 4.9 and higher than 5.

Cirque answered 26/7, 2018 at 19:3 Comment(2)
Hi @Andres: I have the same error. My total corpus is 52428 and my min_df is 10 and max_df is 1. So my max_df should be 52428 and it is bigger than 10 but still get the same error.Weever
Check if you are passing an integer value as parameter, integer values are interpreted as absolutes counts. If you want to avoid terms that appears in 100% of the documents you'll have to pass 1.0 (float) as parameter.Cirque
N
1

The value of max_df should always be less than min_df. The value should be less than 0.9 or 0.009 or 0.0009. Those are to make more precession if your corpus is tremendous values, likely thousand or million or billion.

max_df = 0.9 #default from sklearn is 1.0
min_df = 10  #can be more or less to create a precission value

Both values of max_df and min_df should not be nearby. For example, the corpus is 1000 and max_df is 0.9, and min_df is 10, so comparing new max_df is 900 and min_df is 10 then max_df is still more than not less than min_df.

The solution is max_df should be 0.009 then multiple to 1000, so the new max_df is 9 < min_df is 10. Then the difference between both of those is 1. I think that result of tfidf could be optimal.

#Experiment 1:
max_df = 0.009
min_df = 10

#Experiment 2:
max_df = 0.09
min_df = 100
Nevers answered 2/8, 2021 at 14:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.