How to combine features with different dimensions output using scikit-learn

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2000.

2000 is the size of the training data. This is the main code:

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])

I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
                      corpus=(str([user_data['text']]).strip('[]')) 
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        text=data_dict['bookid'].apply(self.count_contents2)
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

Sample of data (example):

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

Then each id will refer to a text file with the actual contents of these books

I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue. Thanks

class BookContentCount(BaseStep): def transform(self, data_dict, y=None): transformed = do_things(...) # be sure to use SKLearnWrapper if you wrap sklearn items. return transformed def fit(self, x, y=None): return self

3. Create a joiner to join the resuts of the feature union the way you wish:

class CustomJoiner(NonFittableMixin, BaseStep): def __init__(self): BaseStep.__init__(self) NonFittableMixin.__init__(self) # def fit: is inherited from `NonFittableMixin` and simply returns self. def transform(self, data_inputs): # TODO: insert your own concatenation method here. result = np.concatenate(data_inputs, axis=-1) return result

4. Finally create your pipeline by passing the joiner to the FeatureUnion:

book_summary= Pipeline([ ItemSelector(key='book'), TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True) ]) p = Pipeline([ FeatureUnion([ book_summary, BookContentCount() ], joiner=CustomJoiner() ), SVC(kernel='linear', class_weight='balanced') ])

Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().

1. Import Neuraxle's classes.

2. Define your custom class by inheriting from BaseStep:

3. Create a joiner to join the resuts of the feature union the way you wish:

4. Finally create your pipeline by passing the joiner to the FeatureUnion:

Recommended topics

Hot tags