Learning and using augmented Bayes classifiers in python
Asked Answered
G

5

16

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)

The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.

There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.

So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway? Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?


Existing packages I know of, but found inappropriate are

  • milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
  • pebl, which only does structure learning
  • scikit-learn, which only learns naive Bayes classifiers
  • OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
  • libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
  • Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
  • eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.
Gull answered 16/2, 2013 at 23:36 Comment(4)
Have you looked at milk?Situate
The reason I want a Bayesian classifier is that for this particular application, I need to know the probabilities of each possible class. milk does not seem to support Bayesian classifiers at all (or at least I don't see how – if any of milks classifiers gives me probabilities, please tell me), and is therefore out of scope of this question.Gull
Reverend is also a bayesian classifier you might add to the list which might not meet requirements. (?)Deepen
Seems like I need to mash together bayesian and libpgm and add my own stuff on top to get what I want. Unfortunately this is only a minor side project, so that may take some time.Gull
R
5

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35

I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.

scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :

RandomTreesEmbedding implements an unsupervised transformation of the data. Using a forest of completely random trees, RandomTreesEmbedding encodes the data by the indices of the leaves a data point ends up in. This index is then encoded in a one-of-K manner, leading to a high dimensional, sparse binary coding. This coding can be computed very efficiently and can then be used as a basis for other learning tasks. The size and sparsity of the code can be influenced by choosing the number of trees and the maximum depth per tree. For each tree in the ensemble, the coding contains one entry of one. The size of the coding is at most n_estimators * 2 ** max_depth, the maximum number of leaves in the forest.

As neighboring data points are more likely to lie within the same leaf of a tree, the transformation performs an implicit, non-parametric density estimation.

And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html

Discrete naive Bayes models can be used to tackle large scale text classification problems for which the full training set might not fit in memory. To handle this case both MultinomialNB and BernoulliNB expose a partial_fit method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of text documents.

Rail answered 26/11, 2013 at 10:20 Comment(1)
Random Forest Bayes were not what I asked for, because I had not encountered them. Thanks for pointing them out, I'll see if they (or something related) might be suitable for my application.Gull
J
3

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),

import json

from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization

# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")

# toporder graph skeleton
skel.toporder()

# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')

# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)

# load factorization
fn = TableCPDFactorization(bn)

# calculate probability distribution
result = fn.condprobve(query, evidence)

# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)

To get the example to work, here is the datafile (I replaced None with null and saved as a .json).

I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

Japan answered 15/7, 2014 at 1:22 Comment(0)
O
2

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.

http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

Opah answered 12/3, 2015 at 1:4 Comment(1)
Welcome on the Stack and thank you for that pointer! I had vaguely heard of bnlearn before, but with rpy2 that might indeed be something to have a close look at!Gull
G
1

There seems to be no such thing yet.

The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.

  • Advantages:
    • It works. That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
    • With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
  • Disadvantages:
    • There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
    • Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
    • bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.
Gull answered 2/12, 2013 at 12:20 Comment(0)
S
-1

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

Schlesien answered 10/2, 2014 at 1:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.