How to combine features with different dimensions output using scikit-learn
Asked Answered
T

1

12

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2000.

2000 is the size of the training data. This is the main code:

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
]) 

I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
                      corpus=(str([user_data['text']]).strip('[]')) 
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        text=data_dict['bookid'].apply(self.count_contents2)
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

Sample of data (example):

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

Then each id will refer to a text file with the actual contents of these books

I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue. Thanks

Tonietonight answered 20/5, 2018 at 12:4 Comment(11)
Can you please provide some sample data?Backstroke
I've added an example of the dataTonietonight
Ideally, you would provide a minimal working example that reproduces your error. Currently you refer to book_content_count(), which I cannot identify from your code.Kazim
This cannot be done inside a FeatureUnion. It uses numpy.hstack internally, which require number of rows to be equal for all parts. The first part here 'book_summary' will work on whole training data and return a matrix of 2000 rows. But your second part 'book_contents' will return only a single row. How will you combine such data ?Backstroke
Now if you can first calculate the individual tfidf of each book and then combine them into 2000 rows, and then send to FeatureUnion, it will work. But then each tfidf matrix will have different columns (feature words detected by the Vectorizer) depending on the book contents, how will you stack them then?Backstroke
@VivekKumar thats right! by limiting the top features number to let say 10 features per book contents. my main issue is with combining such matrices. how to do so?Tonietonight
But then again. Lets say the first book has top features "this", "that", "something", "random" etc, but the other book has top features "other", "random", "this", "only", then what to do? I mean the top 10 features of one book may not be the same as top 10 of other books. Then combining them will not make any sense.;Backstroke
@DavidDale Book_content_count() is the class for transforming and fitting book contentsTonietonight
@VivekKumar the main purpose of this is to get the value of TFIDF for each book independently. I know it is going to produce a sparse matrix for each book.Tonietonight
Let us continue this discussion in chat.Backstroke
Just curios, had you managed to find a solution? Note that "tfidf for each document independently" is an equivalent to countvectorizer.Buddha
R
1

You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.

1. Import Neuraxle's classes.

from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion

2. Define your custom class by inheriting from BaseStep:

class BookContentCount(BaseStep): 

    def transform(self, data_dict, y=None):
        transformed = do_things(...)  # be sure to use SKLearnWrapper if you wrap sklearn items.
        return transformed

    def fit(self, x, y=None):
        return self

3. Create a joiner to join the resuts of the feature union the way you wish:

class CustomJoiner(NonFittableMixin, BaseStep):
    def __init__(self):
        BaseStep.__init__(self)
        NonFittableMixin.__init__(self)

    # def fit: is inherited from `NonFittableMixin` and simply returns self.

    def transform(self, data_inputs):
        # TODO: insert your own concatenation method here.
        result = np.concatenate(data_inputs, axis=-1)
        return result

4. Finally create your pipeline by passing the joiner to the FeatureUnion:

book_summary= Pipeline([
    ItemSelector(key='book'),
    TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])

p = Pipeline([
    FeatureUnion([
        book_summary,
        BookContentCount()
    ], 
        joiner=CustomJoiner()
    ),
    SVC(kernel='linear', class_weight='balanced')
]) 

Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().

To learn more on Neuraxle: https://github.com/Neuraxio/Neuraxle

More examples from the documentation: https://www.neuraxle.org/stable/examples/index.html

Rule answered 25/8, 2019 at 18:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.