PHP implementation of Bayes classificator: Assign topics to texts
Asked Answered
H

1

6

In my news page project, I have a database table news with the following structure:

 - id: [integer] unique number identifying the news entry, e.g.: *1983*
 - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
 - topic: [string] category which should be chosen by the classificator, e.g: *Sports*

Additionally, there's a table bayes with information about word frequencies:

 - word: [string] a word which the frequencies are given for, e.g.: *real estate*
 - topic: [string] same content as "topic" field above, e.h. *Economics*
 - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*

Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.

Is this the correct implementation? Can you improve it?

<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
    $pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
    if (!isset($pWords[$pWords3['topic']])) {
        $pWords[$pWords3['topic']] = array();
    }
    $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
    $pTextInTopics = array();
    $tokens = tokenizer($get3['title']);
    foreach ($pTopics as $topic=>$documentsInTopic) {
        if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
        foreach ($tokens as $token) {
            echo '....'.$token;
            if (isset($pWords[$topic][$token])) {
                $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
            }
        }
        $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
    }
    asort($pTextInTopics); // pick topic with lowest value
    if ($chosenTopic = each($pTextInTopics)) {
        echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
    }
}
?>

The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.

Sample learning data:

word topic count

kaczynski Politics 1

sony Technology 1

bank Economics 1

phone Technology 1

sony Economics 3

ericsson Technology 2

Sample output/result:

Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry

Politics

....phone ....test ....sony ....ericsson ....aspen ....sensitive ....winberry

Technology

....phone FOUND ....test ....sony FOUND ....ericsson FOUND ....aspen ....sensitive ....winberry

Economics

....phone ....test ....sony FOUND ....ericsson ....aspen ....sensitive ....winberry

Result: The text belongs to topic Technology with a likelihood of 0.013888888888889

Thank you very much in advance!

Horripilate answered 26/8, 2010 at 13:40 Comment(4)
"Would it work"? is a slightly strange question. Have you tried it? Did it work?Cora
No, it doesn't work, "Parse error: syntax error, unexpected ')'...". You might want to refactor the script into a slightly more digestible form. Also test data (in form of CREATE TABLE ...., INSERT INTO ...) and expected results may increase your chance of someone dealing with the problem.Oas
Thank you for your comments. The question wasn't good, you're right. I didn't want to refer to the syntax - I could easily check this. I meant the mathematics and the performance.Horripilate
I've corrected the syntax and added some sample data now.Horripilate
C
7

It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)

Similarly, the calculation of p(topic) could be moved outside of the loop.

Finally, you don't need to sort the entire array to find the maximum.

All small points! But that's what you asked for :)

I've written some untested PHP-code showing how I'd implement this below:

<?php

// Get word counts from database
$nWordPerTopic = mystery_sql();

// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
    // Get total word count in topic
    $nTopic = array_sum($wordCounts);

    // Calculate p(word|topic)
    $pWordPerTopic[$topic] = array();
    foreach($wordCounts as $word => $count)
        $pWordPerTopic[$topic][$word] = $count / $nTopic;

    // Save $nTopic for next step
    $nTopics[$topic] = $nTopic;
}

// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
    $pTopics[$topic] = $nTopic / $nTotal;

// Classify
foreach($documents as $document)
{
    $title = $document['title'];
    $tokens = tokenizer($title);
    $pMax = -1;
    $selectedTopic = null;
    foreach($pTopics as $topic => $pTopic)
    {
        $p = $pTopic;
        foreach($tokens as $word)
        {
            if (!array_key_exists($word, $pWordPerTopic[$topic]))
                continue;
            $p *= $pWordPerTopic[$topic][$word];
        }

        if ($p > $pMax)
        {
            $selectedTopic = $topic;
            $pMax = $p;
        }
    }
} 
?>

As for the maths...

You're trying to maximize p(topic|words), so find

arg max p(topic|words)

(IE the argument topic for which p(topic|words) is the highest)

Bayes theorem says

                  p(topic)*p(words|topic)
p(topic|words) = -------------------------
                        p(words)

So you're looking for

         p(topic)*p(words|topic)
arg max -------------------------
               p(words)

Since p(words) of a document is the same for any topic this is the same as finding

arg max p(topic)*p(words|topic)

The naive bayes assumption (which makes this a naive bayes classifier) is that

p(words|topic) = p(word1|topic) * p(word2|topic) * ...

So using this, you need to find

arg max p(topic) * p(word1|topic) * p(word2|topic) * ...

Where

p(topic) = number of words in topic / number of words in total

And

                   p(word, topic)                         1
p(word | topic) = ---------------- = p(word, topic) * ----------
                      p(topic)                         p(topic)

      number of times word occurs in topic     number of words in total
   = -------------------------------------- * --------------------------
            number of words in total           number of words in topic

      number of times word occurs in topic 
   = --------------------------------------
            number of words in topic
Coupling answered 2/9, 2010 at 12:18 Comment(4)
As an afterthought... Did you think of removing stop words (the, and it, etc) from the word count list?Coupling
Thank you very much for this answer. Removing stop words is a nice way to improve the results.Horripilate
I'm not sure this is calculating the result correctly. To find p(word|topic), I think $nTopic should be the total number of occurrences of the word in across all topics. But this code calculates $nTopic as the the total number of occurrences of all of the words in a given topic, which is meaningless when trying to calculate the probability of a word appearing in a given topic. Also, I don't think p(topic) is calculated correctly either since that's supposed to be the probability that any document belongs in a given topic (based on the ratio of past documents that ended up in each topic).Peart
nTopic is meant to be the number of words in each topic.Coupling

© 2022 - 2024 — McMap. All rights reserved.