Does Vowpal Wabbit shuffle data in multiple online passes?
Asked Answered
R

1

6

Does Vowpal Wabbit automatically shuffle its data after every epoch/pass? I'm hoping the created cache file will contain the shuffling meta-data that is necessary for online algorithms like VW's default online SGD method. E.g.

vw -d train.txt -c --passes 50 -f train.model

If not, I have a backup script that manually shuffles the data on every pass

# Create the initial regressor file
vw -d train.txt -f train.model
# For the next 49 passes, shuffle and then update the regressor file
for i in {0..49}
do
    <some script: train.txt --> shuffled_data.txt>
    vw -d shuffled_data.txt -i train.model -f train.model
done

If VW doesn't automatically shuffle, then is there a more efficient way of performing the above code block? VW's wiki is unfortunately unclear with regards to this. Thanks.

Rickierickman answered 6/1, 2014 at 0:42 Comment(5)
I was seconds away from voting to close when I checked the tag!Thalamencephalon
Why? What did I do wrong?Rickierickman
Nothing! I was just ignorant about the existence of the library being asked about, good question +1Thalamencephalon
It is that "vowpal wabbit" name. Strikes humor (and skepticism?) in the mind of the previously uninitiated.Unloose
Shuffling like this seems to me like it will just overfit the training set.Slideaction
H
2

No, it doesn't shuffle. I'd bet it's not worth shuffling the data either. Shuffling is very I/O intensive. While it might be better to do two passes with different shuffle order than two passes without shuffling, in terms of convergence, it's probably as costly as 10 passes without shuffling.

Henze answered 6/1, 2014 at 2:39 Comment(2)
Would that rule of thumb extend to cases where there are millions of examples with sparse features? It seems that no matter how many passes I perform, the algorithm seems to have already converged on the very first pass, if I don't shuffle.Rickierickman
If VW (run from Shell) doesn't shuffle train data, maybe that's why when I give it a train file with the ordered labels, it results in almost zero accuracy. However, when I launch VW through Python sklearn, it seems that it does shuffling because accuracy is just fine.Leastways

© 2022 - 2024 — McMap. All rights reserved.