I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2000.
2000 is the size of the training data. This is the main code:
book_summary= Pipeline([
('selector', ItemSelector(key='book')),
('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])
book_contents= Pipeline([('selector3', book_content_count())])
ppl = Pipeline([
('feats', FeatureUnion([
('book_summary', book_summary),
('book_contents', book_contents)])),
('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])
I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.
class book_content_count():
def count_contents2(self, bookid):
book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')
book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
corpus=(str([user_data['text']]).strip('[]'))
return corpus
def transform(self, data_dict, y=None):
data_dict['bookid'] #from here take the name
text=data_dict['bookid'].apply(self.count_contents2)
vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
Xtr = vec_pipe.fit_transform(text)
return Xtr
def fit(self, x, y=None):
return self
Sample of data (example):
title Summary bookid
The beauty and the beast is a traditional fairy tale... 10
ocean at the end of the lane is a 2013 novel by British 11
Then each id will refer to a text file with the actual contents of these books
I have tried toarray
and reshape
functions but with no luck. Any idea how to solve this issue.
Thanks
book_content_count()
, which I cannot identify from your code. – Kazim'book_summary'
will work on whole training data and return a matrix of 2000 rows. But your second part'book_contents'
will return only a single row. How will you combine such data ? – Backstroke