Save progress between multiple instances of partial_fit in Python SGDClassifier
Asked Answered
T

1

0

I've successfully followed this example for my own text classification script.

The problem is I'm not looking to process pieces of a huge, but existing data set in a loop of partial_fit calls, like they do in the example. I want to be able to add data as it becomes available, even if I shut down my python script in the meantime.

Ideally I'd like to do something like this:

sometime in 2015:

model2015=partial_fit(dataset2015)

save_to_file(model2015)

shut down my python script

sometime in 2016:

open my python script again

load_from_file(model2015)

partial_fit(dataset2016 incorporating model2015)

save_to_file(model2016)

sometime in 2017:

open my python script again

etc...

Is there any way I can do this in scikit-learn? Or in some other package (Tensorflow perhaps)?

Tweed answered 26/2, 2016 at 22:5 Comment(0)
A
1

Simply pickle your model and save it to disk. The other way is to dump .coef_ and .intercept_ fields (which is just two arrays) and use them as initializers when you call .fit

Anticline answered 26/2, 2016 at 22:16 Comment(7)
Pickle and joblib definitely don't work when you want to fit again (they do work when you just want to predict): they get overridden by a new call to fit or partial_fit. But I'll take a look at using .fit (I guess partial_fit isn't really necessary) with the .coef and .intercept from the previous iteration, that sounds like it might work. Thank you!Tweed
Ok, using the .coef and .intercept arguments did improve accuracy when I called fit again, so this seems to be doing what I wanted it to do. It did create a new problem: the different data sets can have differing numbers of categories (classes), with partial_fit I could simply tell it to use all possible categories (I have a file with a list of them), but with fit I can't and I get an error when the .coef array has a different size than the number of categories in the current data set. So I need a way to set classes in fit like in partial_fit, or to set a coef and intercept in partial_fit.Tweed
"Hack" solution is to create summy points with remaining labels (for example with "0" on all features) to force it to accept the labeling. Furthermore, as both coef and intercept are class number dependent - you could just start from the vectors of size corresponding to your whole labels set (so you call first fit with random weights as inits, but with predefined size)Anticline
The hack solution worked (and it's the only thing that works, it seems). pasting artificial examples to the front of the training data set works. Problem solved.Tweed
@lejlot, when i loaded just the coef, intercepts for new model like here: ideone.com/mxMKJj, i got the error "provided coef_init doesnot match with dataset". x2 and y2 is my new data. How this worked for you ? Could you please share more details ?Annexation
@Annexation if you have a question - please ask a question. This is not a comment to the answer.Anticline
Hi @lejlot. sorry, i asked it here too: #49133688 please have a look onceAnnexation

© 2022 - 2024 — McMap. All rights reserved.