The n-gram that is the most frequent one among all the words
Asked Answered
H

8

9

I came across the following programming interview problem:

Challenge 1: N-grams

An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

Note that your function will receive the following arguments:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

Data constraints

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

Efficiency constraints

• your function is expected to print the result in less than 2 seconds

Example Input text: “aaaab a0a baaab c”

Output aaa ngramLength: 3

Explanation

For the input presented above the 3-grams sorted by frequency are:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...

If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....

Is that the fastest option possible?

Thank you!!!

Histogen answered 4/9, 2014 at 0:27 Comment(6)
Is this for an actual coding interview?Unclog
Did you determine that a binary tree (say, AVL) could not do the job?Cleveite
Are 3-grams the most you'll be asked for? There are (26+26+10)^3 = 238328 possible 3-grams with only alphanumeric characters, therefore a straight-up LUT looks feasible.Cleveite
I would allocate the needed number of buckets in advance, in a single array (which is possible as you have an upper bound on the length of the text), and only store pointers to them in the hash table. Use move to front/insert at back heuristics to make hash table retrieval faster. And sort the array at the end. Using a tree is slower in practice.Scandic
Maybe not directly relevant, but fun nonetheless.Arciform
Think. In a 1000 characters of text, how many 3-grams are there?Wera
H
6

If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.

This should execute fast enough, and there is no need to code any fancy data structures.

Heterotaxis answered 4/9, 2014 at 1:9 Comment(5)
The only thing I can think of that might be faster than this is using offsets instead of actual pointers. Assuming 64bit (native) pointers, you could halve the ram with 4 byte offsets. A clever encoding might squeeze the needed 18 bits into 2 bytes for more..Arciform
@Arciform The sort is O(n log(n)) while a hash based solution is O(n). So this should not have the best possible performance. It is just a very simple approach.Heterotaxis
On small scales the asymptotic limit matters less than the particulars of hardware. I suspect that cache line efficiency and the details of the chosen hashing function will matter a lot. Again, only guessing.Arciform
@Arciform I see no reason why jumping to a random bucket in a hash is less efficient use of cache than comparing two random pointers from a region. So hashing is always a win. But if you make an array of ngrams then sort that (rather than using pointers), you might beat hashing.Heterotaxis
Great answer. After sorting, you can count with O(1) space in one pass from larger array index to smaller.Grossularite
F
2

Sorry for posting python but this is what I would do: You might get some ideas for the algorithm. Notice this program solves an order of magnitude more words.

from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**
Flew answered 4/9, 2014 at 2:8 Comment(2)
In my test, maintaining a dictionary of ngram/count was twice as fast as your solution. And was simpler code.Heterotaxis
I would expect both, that's why I mentioned it inline with comments. I've added an answer in c++ to my post. This was a fun little problem I think.Flew
F
1

So the basic recipe for this problem would be:

  1. Find all n-grams in string
  2. Map all duplicate entries into a new structure that has the n-gram and the number of times it occurs

You can find my c++ solution here: http://ideone.com/MNFSis

Given:

const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };

Then, for the first step - this is the code:

const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) { 
    //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
    if (noSpace(ptr)) {
        //safely copy our current n-gram over to the ngrams array
        //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
        //are valid letters
        for (int i=0; i<NGRAM; i++) {
            ngrams[idx][i] = ptr[i];
        }
        ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
        idx++;
    }
    ptr++;
}

At this point, we have a list of all n-grams. Lets find the most popular one:

FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list

for (int i=0; i<NGRAMS; i++) {
    if (ngrams[i][0] == '\0') break;
    //insertFreqNode takes a start node, this where we will start to search for duplicates
    //the simplest description is like this:
    //  1 we search from head down each child, if we find a node that has text equal to
    //    ngrams[i] then we update it's frequency count
    //  2 if the freq is >= to the current winner we place this as head.next
    //  3 after program is complete, our most popular nodes will be the first nodes
    //    I have not implemented sorting of these - it's an exercise for the reader ;)
    insertFreqNode(&head, ngrams[i]);
}

//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl

Good luck to you!

Flew answered 4/9, 2014 at 4:4 Comment(0)
S
1

Just for fun, I wrote a SQL version (SQL Server 2012):

if object_id('dbo.MaxNgram','IF') is not null
    drop function dbo.MaxNgram;
go

create function dbo.MaxNgram(
     @text      varchar(max)
    ,@length    int
) returns table with schemabinding as
return
    with 
    Delimiter(c) as ( select ' '),
    E1(N) as (
        select 1 from (values 
            (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
        )T(N)
    ),
    E2(N) as (
        select 1 from E1 a cross join E1 b
    ),
    E6(N) as (
        select 1 from E2 a cross join E2 b cross join E2 c
    ),
    tally(N) as (
        select top(isnull(datalength(@text),0))
             ROW_NUMBER() over (order by (select NULL))
        from E6
    ),
    cteStart(N1) as (
        select 1 union all
        select t.N+1 from tally t cross join delimiter 
            where substring(@text,t.N,1) = delimiter.c
    ),
    cteLen(N1,L1) as (
        select s.N1,
               isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000)
        from cteStart s
        cross join delimiter
    ),
    cteWords as (
        select ItemNumber = row_number() over (order by l.N1),
               Item       = substring(@text, l.N1, l.L1)
        from cteLen l
    ),
    mask(N) as ( 
        select top(@length) row_Number() over (order by (select NULL))
        from E6
    ),
    topItem as (
        select top 1
             substring(Item,m.N,@length) as Ngram
            ,count(*)                    as Length
        from cteWords   w
        cross join mask m
        where m.N     <= datalength(w.Item) + 1 - @length
          and @length <= datalength(w.Item) 
        group by 
            substring(Item,m.N,@length)
        order by 2 desc, 1 
    )
    select d.s
    from (
        select top 1 NGram,Length
        from topItem
    ) t
    cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go

which when invoked with the sample input provided by OP

set nocount on;
select s as [ ] from MaxNgram(
    'aaaab a0a baaab c aab'
   ,3
);
go

yields as desired

------------------------------
aaa
3
Sumatra answered 4/9, 2014 at 6:1 Comment(0)
C
1

If you're not bound to C, I've written this Python script in about 10 minutes which processes 1.5Mb file, containing more than 265 000 words looking for 3-grams in 0.4s (apart from printing the values on the screen)
The text used for the test is Ulysses of James Joyce, you can find it free here https://www.gutenberg.org/ebooks/4300

Words separators here are both space and carriage return \n

import sys

text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
    word_len = len(word)
    if word_len < ngram_len:
        continue
    for i in range(0, (word_len - ngram_len) + 1):
        ngram = word[i:i+ngram_len]
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
        if val not in ngrams_by_freq:
                ngrams_by_freq[val] = [key]
        else:
                ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
        print('{} with frequency of {}'.format(key[1:], key[0]))
Corpulent answered 28/1, 2017 at 18:29 Comment(0)
C
0

You can convert trigram into RADIX50 code. See http://en.wikipedia.org/wiki/DEC_Radix-50

In radix50, output value for trigram fits into 16-bit unsigned int value.

Thereafter, you can use radix-encoded trigram as index in the array.

So, your code would be like:

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

Thereafter, just search for max value in the array, and decode index into trigram back.

I used this trick, when implemented Wilbur-Khovayko algorithm for fuzzy search ~10 years ago.

You can download source here: http://itman.narod.ru/source/jwilbur1.tar.gz.

Colter answered 4/9, 2014 at 1:15 Comment(3)
This is case-insensitive A-Z, 0-9, SPACE, DOLLAR, DOT, and UNDEF. Enough for compute trigrams for text string.Colter
But you don't now the ngramLength parameter beforehand, and this heavily depends on the fact that n=3. Clever solution for a trigram, thoughPonzo
start with radix50, for each c in counters.sorteddesc find most frequest starting -> counters2 //now the only thing see if something could be bigger among rest counters if ( max(counters2) > next in counters and bigger than local max found before) RESULT else store local max, and proceed to the next item Epilimnion
T
0

You can solve this problem in O(nk) time where n is the number of words and k is the average number of n-grams per word.

You're correct in thinking that a hash table is a good solution to the problem.

However, since you have limited time to code a solution, I'd suggest using open addressing instead of a linked list. The implementation may be simpler: if you reach a collision you just walk farther along the list.

Also, be sure to allocate enough memory to your hash table: something about twice the size of the expected number of n-grams should be fine. Since the expected number of n-grams is <=250,000 a hash table of 500,000 should be more than sufficient.

In terms of coding speed, the small input length (250,000) makes sorting and counting a feasible option. The quickest way is probably to generate an array of pointers to each n-gram, sort the array using an appropriate comparator, and then walk along it keeping track of the which n-gram appeared the most.

Turin answered 28/1, 2017 at 19:34 Comment(0)
V
0

One simple python solution for this question

your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])
Varicolored answered 6/3, 2021 at 16:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.