Extract keyphrases from text (1-4 word ngrams)

C

3

11

What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this. I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript solution. If there aren't any existing JavaScript libraries, could someone explain how to do this so I can just write it myself?

Cult answered 16/8, 2011 at 21:47 Comment(0)

B

19

I like the idea, so I've implemented it: See below (descriptive comments are included).
Preview at: https://jsfiddle.net/WsKMx

/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (https://mcmap.net/q/980951/-extract-keyphrases-from-text-1-4-word-ngrams/938089)
 * Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
 * This script will calculate words. For the simplicity and efficiency,
 * there's only one loop through a block of text.
 * A 100% accuracy requires much more computing power, which is usually unnecessary
 **/


var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";

var atLeast = 2;       // Show results with at least .. occurrences
var numWords = 5;      // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z'\-]+/g;
 // RE pattern to select valid characters. Invalid characters are replaced with a whitespace

var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
    keys.push({});
}

// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");

// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/\s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
    s = text[i];
    keys[1][s] = (keys[1][s] || 0) + 1;
    for (j=2; j<=numWords; j++) {
        if(i+j <= textlen) {
            s += " " + text[i+j-1];
            keys[j][s] = (keys[j][s] || 0) + 1;
        } else break;
    }
}

// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
    results[k] = [];
    var key = keys[k];
    for (var i in key) {
        if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
    }
}

// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`

var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
    results[k].sort(f_sortAscending);//sorts results
    
    // Customize your output. For example:
    var words = results[k];
    if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
    for (i=0,len=words.length; i<len; i++) {
        
        //Characters have been validated. No fear for XSS
        outputHTML.push("<td>" + words[i].word + "</td><td>" +
           words[i].count + "</td><td>" +
           Math.round(words[i].count/textlen*10000)/100 + "%</td>");
           // textlen defined at the top
           // The relative occurence has a precision of 2 digits.
    }
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
              '<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
              '</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
               "</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}

HTML:
<div id="#RobW-sample"></div>
*/

Bewick answered 16/9, 2011 at 23:10 Comment(3)

I've updated the code to fix a bug in IE8. This bug was reported via mail, I've pasted the mail and my response (which offers the fix and includes a detailed explanation) here: pastebin.com/7Edx88Gp. – Bewick 17/7, 2012 at 15:1

beautiful, several years later you are still helping people – Leavening 18/3, 2015 at 8:25

would be nice to exclude so called stop words, like: the, a, they, is, etc. – Pentachlorophenol 1/6, 2019 at 14:11

C

0

I do not know such a library in JavaScript but the logic is

split text into array
then sort and count

alternatively

split into array
create a secondary array
traversing each item of the 1st array
check whether current item exists in secondary array
if not exists push it as a item's key
else increase value having a key = to item sought. HTH

Ivo Stoykov

Claus answered 24/8, 2011 at 8:57 Comment(2)

this doesnt not do what im wanting b/c it does not extract multi word ngrams... it works for single words only – Cult 24/8, 2011 at 15:50

look here -> valuetype.wordpress.com/2011/08/24/… this is a sample with one word count but could be easily extended for 3 or 4 words – Claus 24/8, 2011 at 19:31

P

0

function ngrams(seq, n) {
  to_return = []
  for (let i=0; i<seq.length-(n-1); i++) {
      let cur = []
      for (let j=i; j<seq.length && j<=i+(n-1); j++) {
          cur.push(seq[j])
      }
      to_return.push(cur.join(''))
  }
  return to_return
}

> ngrams(['a', 'b', 'c'], 2)
['ab', 'bc']

Plainsong answered 9/1, 2023 at 23:9 Comment(0)

Recommended topics

Hot tags