Information Gain calculation with Scikit-learn
Asked Answered
G

3

34

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix.

  • the Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.
  • in weka, this would be calculated with InfoGainAttribute.
  • But I haven't found this measure in scikit-learn.

(It was suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia. Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?)

Gadmann answered 15/10, 2017 at 7:17 Comment(2)
They are the same Information gain and mutual information: different or equal?, Feature Selection: Information Gain VS Mutual Information, when we're talking about Pointwise Mutual Information not Expected Mutual Information.Dihedron
@NickMorgan: they are the same when talking about PMI; also it's unhelpful to quote an ephemeral source (table in a third-party paper which has now expired), instead of CV or Wikipedia.Dihedron
E
32

You can use scikit-learn's mutual_info_classif here is an example

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer

categories = ['talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

X, Y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
                                     max_features=10000,
                                     stop_words='english')
X_vec = cv.fit_transform(X)

res = dict(zip(cv.get_feature_names(),
               mutual_info_classif(X_vec, Y, discrete_features=True)
               ))
print(res)

this will output a dictionary of each attribute, i.e. item in the vocabulary as keys and their information gain as values

here is a sample of the output

{'bible': 0.072327479595571439,
 'christ': 0.057293733680219089,
 'christian': 0.12862867565281702,
 'christians': 0.068511328611810071,
 'file': 0.048056478042481157,
 'god': 0.12252523919766867,
 'gov': 0.053547274485785577,
 'graphics': 0.13044709565039875,
 'jesus': 0.09245436105573257,
 'launch': 0.059882179387444862,
 'moon': 0.064977781072557236,
 'morality': 0.050235104394123153,
 'nasa': 0.11146392824624819,
 'orbit': 0.087254803670582998,
 'people': 0.068118370234354936,
 'prb': 0.049176995204404481,
 'religion': 0.067695617096125316,
 'shuttle': 0.053440976618359261,
 'space': 0.20115901737978983,
 'thanks': 0.060202010019767334}
Equipotential answered 15/10, 2017 at 13:12 Comment(3)
is this an information gain or mutual info gain? are they same? @EquipotentialFinality
In the above code, if I am keeping 'binary=True' while vectorizing the text, am getting different Mutual Information results. Ideally, mutual information on categorical variables works solely based on the presence or nor right, irrespective of the count?Inkblot
Is there a way to calculate MI between each feature/term and each category with scikit-larn, e.g. what is MI between "bible" and "'talk.religion.misc'". Chapter from IR book provides some math for that, but I cannot find a way to do that with scikit-learn.Hallway
F
5

Here is my proposition to calculate the information gain using pandas:

from scipy.stats import entropy
import pandas as pd
def information_gain(members, split):
    '''
    Measures the reduction in entropy after the split  
    :param v: Pandas Series of the members
    :param split:
    :return:
    '''
    entropy_before = entropy(members.value_counts(normalize=True))
    split.name = 'split'
    members.name = 'members'
    grouped_distrib = members.groupby(split) \
                        .value_counts(normalize=True) \
                        .reset_index(name='count') \
                        .pivot_table(index='split', columns='members', values='count').fillna(0) 
    entropy_after = entropy(grouped_distrib, axis=1)
    entropy_after *= split.value_counts(sort=False, normalize=True)
    return entropy_before - entropy_after.sum()

members = pd.Series(['yellow','yellow','green','green','blue'])
split = pd.Series([0,0,1,1,0])
print (information_gain(members, split))
Formic answered 24/1, 2020 at 10:46 Comment(0)
G
0

Using pure python:

def ig(class_, feature):
  classes = set(class_)

  Hc = 0
  for c in classes:
    pc = list(class_).count(c)/len(class_)
    Hc += - pc * math.log(pc, 2)
  print('Overall Entropy:', Hc)
  feature_values = set(feature)

  Hc_feature = 0
  for feat in feature_values:

    pf = list(feature).count(feat)/len(feature)
    indices = [i for i in range(len(feature)) if feature[i] == feat]
    clasess_of_feat = [class_[i] for i in indices]
    for c in classes:
        pcf = clasess_of_feat.count(c)/len(clasess_of_feat)
        if pcf != 0:
            temp_H = - pf * pcf * math.log(pcf, 2)
            Hc_feature += temp_H
  ig = Hc - Hc_feature
  return ig    
Gamecock answered 19/11, 2020 at 15:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.