PHP implementation of Bayes classificator: Assign topics to texts

- id: [integer] unique number identifying the news entry, e.g.: *1983* - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name* - topic: [string] category which should be chosen by the classificator, e.g: *Sports*

- word: [string] a word which the frequencies are given for, e.g.: *real estate* - topic: [string] same content as "topic" field above, e.h. *Economics* - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*

<?php include 'mysqlLogin.php'; $get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150"; $get2 = mysql_abfrage($get1); // pTOPICS BEGIN $pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic"; $pTopics2 = mysql_abfrage($pTopics1); $pTopics = array(); while ($pTopics3 = mysql_fetch_assoc($pTopics2)) { $pTopics[$pTopics3['topic']] = $pTopics3['count']; } // pTOPICS END // pWORDS BEGIN $pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes"; $pWords2 = mysql_abfrage($pWords1); $pWords = array(); while ($pWords3 = mysql_fetch_assoc($pWords2)) { if (!isset($pWords[$pWords3['topic']])) { $pWords[$pWords3['topic']] = array(); } $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count']; } // pWORDS END while ($get3 = mysql_fetch_assoc($get2)) { $pTextInTopics = array(); $tokens = tokenizer($get3['title']); foreach ($pTopics as $topic=>$documentsInTopic) { if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; } foreach ($tokens as $token) { echo '....'.$token; if (isset($pWords[$topic][$token])) { $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]); } } $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments } asort($pTextInTopics); // pick topic with lowest value if ($chosenTopic = each($pTextInTopics)) { echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>'; } } ?>

It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)

Similarly, the calculation of p(topic) could be moved outside of the loop.

Finally, you don't need to sort the entire array to find the maximum.

All small points! But that's what you asked for :)

I've written some untested PHP-code showing how I'd implement this below:

<?php

// Get word counts from database
$nWordPerTopic = mystery_sql();

// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
    // Get total word count in topic
    $nTopic = array_sum($wordCounts);

    // Calculate p(word|topic)
    $pWordPerTopic[$topic] = array();
    foreach($wordCounts as $word => $count)
        $pWordPerTopic[$topic][$word] = $count / $nTopic;

    // Save $nTopic for next step
    $nTopics[$topic] = $nTopic;
}

// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
    $pTopics[$topic] = $nTopic / $nTotal;

// Classify
foreach($documents as $document)
{
    $title = $document['title'];
    $tokens = tokenizer($title);
    $pMax = -1;
    $selectedTopic = null;
    foreach($pTopics as $topic => $pTopic)
    {
        $p = $pTopic;
        foreach($tokens as $word)
        {
            if (!array_key_exists($word, $pWordPerTopic[$topic]))
                continue;
            $p *= $pWordPerTopic[$topic][$word];
        }

        if ($p > $pMax)
        {
            $selectedTopic = $topic;
            $pMax = $p;
        }
    }
} 
?>

As for the maths...

You're trying to maximize p(topic|words), so find

arg max p(topic|words)

(IE the argument topic for which p(topic|words) is the highest)

Bayes theorem says

                  p(topic)*p(words|topic)
p(topic|words) = -------------------------
                        p(words)

So you're looking for

         p(topic)*p(words|topic)
arg max -------------------------
               p(words)

Since p(words) of a document is the same for any topic this is the same as finding

arg max p(topic)*p(words|topic)

The naive bayes assumption (which makes this a naive bayes classifier) is that

p(words|topic) = p(word1|topic) * p(word2|topic) * ...

So using this, you need to find

arg max p(topic) * p(word1|topic) * p(word2|topic) * ...

Where

p(topic) = number of words in topic / number of words in total

And

                   p(word, topic)                         1
p(word | topic) = ---------------- = p(word, topic) * ----------
                      p(topic)                         p(topic)

      number of times word occurs in topic     number of words in total
   = -------------------------------------- * --------------------------
            number of words in total           number of words in topic

      number of times word occurs in topic 
   = --------------------------------------
            number of words in topic

Recommended topics

Hot tags