How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?
Asked Answered
C

3

20

In most of the Scikit-learn algorithms, the data must be loaded as a Bunch object. For many example in the tutorial load_files() or other functions are used to populate the Bunch object. Functions like load_files() expect data to be present in certain format, but I have data stored in a different format, namely a CSV file with strings for each field.

How do I parse this and load data in the Bunch object format?

Countermand answered 10/12, 2013 at 3:39 Comment(2)
To be sure: none of the algorithms load Bunch objects. The example scripts use those, but the algorithms all want arrays or sparse matrices.Melly
@Blake, the fit method of the classifier takes in a couple of list objects - list of data (Bunch.data) followed by a list of target(Bunch.target) - clf.fit(<list>, <list>).Ergot
T
20

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

Tensile answered 10/12, 2013 at 10:14 Comment(2)
Thanks, Is there any utility function to load .CSV? Have a csv with for columns (all strings). Right now i am using python csv reader docs.python.org/2/library/csv.htmlCountermand
I would recommend pandas: pandas.pydata.org you need to convert the pandas.DataFrame to an np array before feeding it to sklearn, though.Mouser
P
24

You can do it like this:

import numpy as np
import sklearn.datasets

examples = []
examples.append('some text')
examples.append('another example text')
examples.append('example 3')

target = np.zeros((3,), dtype=np.int64)
target[0] = 0
target[1] = 1
target[2] = 0
dataset = sklearn.datasets.base.Bunch(data=examples, target=target)
Plenipotentiary answered 21/12, 2016 at 12:15 Comment(1)
@MachineEpsilon The question shows a misunderstanding in how you feed data to a classifier in scikit-learn, so even though this answers the literal question, it doesn't clear up the original misunderstanding. This link makes it pretty clear: scikit-learn.org/stable/…Oriya
T
20

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

Tensile answered 10/12, 2013 at 10:14 Comment(2)
Thanks, Is there any utility function to load .CSV? Have a csv with for columns (all strings). Right now i am using python csv reader docs.python.org/2/library/csv.htmlCountermand
I would recommend pandas: pandas.pydata.org you need to convert the pandas.DataFrame to an np array before feeding it to sklearn, though.Mouser
B
0

This is an example of Breast Cancer Wisconsin (Diagnostic) Data Set, you can find the CSV file in Kaggle:

  1. From column 2 at 32 in the CSV file are X_train and X_test data @usecols=range(2,32) this is stored in the Bunch Object key named data

    from numpy import genfromtxt
    data = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1,  usecols=range(2,32))
    
  2. I am interested in the column data B (column 1 in Numpy Array @usecols=(1)) in the CSV file because it is the output of y_train and y_test and is stored in the Bunch Object Key named: target

    import pandas as pd
    target = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1, usecols=(1), dtype=str)
    

    There are some tricks to transform the target like it has in sklearn, of course it can be made in a unique variable target, target1, ... is separated only to explain what I did.

  3. First transform the numpy into a Panda

    target2 = pd.Series(target)
    
  4. It's for use the rank function, you could skip the step number 5

    target3 = target2.rank(method='dense', axis=0)
    
  5. This is only for transform the target in 0 or 1 like the example in the Book

    target4 = (target3 % 2 == 0) * 1 
    
  6. Got values into numpy

    target5 = target4.values
    

Here I copied Hugh Perkins's solution:

import sklearn
dataset = sklearn.datasets.base.Bunch(data=data, target=target5)
Balneology answered 14/10, 2017 at 5:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.