unable to use FeatureUnion in scikit-learn due to different dimensions
Asked Answered
G

2

14

I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions


Implementaion

My FeatureUnion is built the following way:

    features = FeatureUnion([
        ('f1', Pipeline([
            ('get', GetItemTransformer('f1')),
            ('transform', vectorizer_f1)
        ])),
        ('f2', Pipeline([
            ('get', GetItemTransformer('f2')),
            ('transform', vectorizer_f1)
        ]))
    ])

GetItemTransformer is used to get different parts of data out of the same structure. The Idea is described here in the scikit-learn issue-tracker.

The Structure itself is stored as {'f1': data_f1, 'f2': data_f2} where data_f1 are different lists with different lengths.


Question

Since the Y-Vector is different to the Data-Fields I assume that the error occurs, but how can I scale the vector to fit in both cases?

Goldsworthy answered 11/9, 2014 at 19:22 Comment(1)
a short and ugly solution would be to concat data_f1 and data_f2 to the lenght of data_f2 and set the length of the Y-Vector to data_f2Goldsworthy
F
7

Here's what worked for me:

class ArrayCaster(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, data):
    print data.shape
    print np.transpose(np.matrix(data)).shape
    return np.transpose(np.matrix(data))

FeatureUnion([('text', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('vect', CountVectorizer(ngram_range=(1,1), binary=True, min_df=3)),
            ('tfidf', TfidfTransformer())
          ])
        ),

        ('other data', Pipeline([
            ('selector', ItemSelector(key='has_foriegn_char')),
            ('caster', ArrayCaster())
          ])
        )])
Fairlie answered 14/10, 2016 at 3:23 Comment(0)
P
3

I don't know if this applies to your question, but we ran into the same error in a slightly different situation and just solved it.

Our f1 entries were each lists of 15 numeric values and we needed to do tf-idf on f2. This generated the same error about incompatible row dimensions.

After running it through the debugger, we found that the shapes of our matrices were subtly different going into the hstack() call in FeatureUnion: (2569,) and (2659, 706).

If we cast f1 to a 2D numpy array, the shape changed to (2659, 15) and the hstack call works.

The cast was something like this: f1 = np.array(list(f1)).

Potsdam answered 5/2, 2016 at 22:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.