text classification with SciKit-learn and a large dataset
Asked Answered
G

2

2

First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)? Is there another way of representing the features so that it can fit in my memory ?

edit: I want the to the Bernoulli NB

edit2: Maybe it is possible with online learning ? read a tweet, let the model use the tweet, remove it from memory , read another, let the model learn... but I don't think Bernoulli NB allows for online learning in scikit-learn

Govern answered 6/12, 2012 at 10:20 Comment(0)
S
3

I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.

Seicento answered 6/12, 2012 at 10:27 Comment(1)
Actually, the scikits.sparse package is irrelevant to this problem, and scikit-learn includes quite some functionality to hide the complexity of scipy.sparse from the user, esp. in the case of document classification. -1 for suggesting the OP roll their own.Sclerenchyma
G
7

If you use scikits' vectorizers (CountVectorizer or TfidfVectorizer are good as a first attempt) you get a sparse matrix representation. From the documentation:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
#initialize your classifier
clf.fit(X_train, y_train)
Giddy answered 6/12, 2012 at 11:47 Comment(1)
Also worth mentioning the documentation of text feature extraction documentation that explicitly deal with sparsity issues with text data.Tattle
S
3

I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.

Seicento answered 6/12, 2012 at 10:27 Comment(1)
Actually, the scikits.sparse package is irrelevant to this problem, and scikit-learn includes quite some functionality to hide the complexity of scipy.sparse from the user, esp. in the case of document classification. -1 for suggesting the OP roll their own.Sclerenchyma

© 2022 - 2024 — McMap. All rights reserved.