How to make my hashing algorithm faster
Asked Answered
A

2

5

My question is connected with task from CS50, pset5. For ones who don't know any about that, I'll try to explain. Nothing very special. I just need to make function which will intake dictionary file (it was written before, all of the words in that file are uppercase), which contains more over 20K words, and sort them somehow. I've made simple and naive algorithm, building hash-table, which sort words, depending on the theirs first letters. And I've passed all checks by the CS50, so my program is working well. But comparing to the course's one - it is too slow. Time of executing for personnel's is 0.1s, but for mine - 5.0s - 7.0s. What can I improve in this code to make it faster? Or should I totally change everything? I have no experience in optimization, `cause just started learning. It would be great to study from any of you =) Thanks in advance!

// Some constant values, which are declared before the function
#define LENGTH  46
#define ALPHALENGTH  26
/* Definition of node struct. Nothing special, in fact =) */
typedef struct node {
    char word[LENGTH +1];
    struct node *next;
} node;

node *hashTable[ALPHALENGTH];

bool load(const char *dictionary) { 
    FILE *f = fopen(dictionary, "r");

    if (f == NULL) {
        return false;
    }

    char word[LENGTH + 1];    
    int hash = 0;

    for (int i = 0; i < ALPHALENGTH; i++) {
        hashTable[i] = NULL;
    }

    // 46 - is LENGTH, but for some reason it is impossible 
    // to put variable`s name between quotation marks
    while (fscanf(f, "%46s", word) == 1) {
        // make every letter lowercase to get result from 0 - 25
        hash = tolower(word[0]) - 'a';
        node *new_node = malloc(sizeof(node));
        strcpy(new_node->word, word);

        // check if element is first in the list
        if (hashTable[hash] == NULL) {
            new_node->next = NULL;
            hashTable[hash] = new_node;
        } else {
            node *ptr = hashTable[hash];

            do {
                if (ptr->next == NULL) {
                    break;
                }
                ptr = ptr->next;
            } while (true);

            ptr->next = new_node;
            new_node->next = NULL;
        }
    }

    fclose(f);

    return true;
}
Apparitor answered 14/7, 2016 at 5:43 Comment(2)
Is there any reason why you're using a linked list? Have you had a look at using any other data structure? Do you know the complexity of look ups of different data structures?Inapplicable
@Martinn Yeah, I understand that complexity is O(n). Well, it was a data structure type, we were passing on the lecture, so I've decided to choose that one. Also we had tries, but I understand totally nothing :) Do you think, that the problem is because of linked list, as a structure?Apparitor
S
2

The problem is not in your hash function, nor in the size of your hash table, it is in your list management: your method for appending words to the corresponding lists has a complexity of O(N2).

By the way, your hash function is not used for hashing, but for dispatching. You are sorting the table only on the first letter of each word, keeping the words with the same initial in the same order. If you meant to sort the dictionary completely, you would still need to sort each list.

You can drastically improve the performance while keeping the same semantics by prepending to the lists and reversing the lists at the end of the parsing phase.

For a dictionary with 20 thousand words, the code below runs 50 times faster (as expected by the CS50 site):

#include <ctype.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LENGTH  46
#define ALPHALENGTH  26
typedef struct node {
    struct node *next;
    char word[LENGTH +1];
} node;

node *hashTable[ALPHALENGTH];

bool load(const char *dictionary) {
    FILE *f = fopen(dictionary, "r");

    if (f == NULL) {
        return false;
    }

    char word[LENGTH + 1];
    int hash = 0;

    for (int i = 0; i < ALPHALENGTH; i++) {
        hashTable[i] = NULL;
    }

    while (fscanf(f, "%46s", word) == 1) {
        node *new_node = malloc(sizeof(node));
        if (new_node == NULL)
            return false;
        // make every letter lowercase to get result from 0 - 25
        hash = tolower(word[0]) - 'a';
        strcpy(new_node->word, word);
        /* prepending to the list */
        new_node->next = hashTable[hash];
        hashTable[hash] = new_node;
    }

    for (int i = 0; i < ALPHALENGTH; i++) {
        node *n, *prev, *next;
        /* reverse list */
        for (prev = NULL, n = hashTable[i]; n != NULL; ) {
            next = n->next;
            n->next = prev;
            prev = n;
            n = next;
        }
        hashTable[i] = prev;
    }

    fclose(f);

    return true;
}

void save(void) {
    for (int i = 0; i < ALPHALENGTH; i++) {
        for (node *n = hashTable[i]; n != NULL; n = n->next) {
            puts(n->word);
        }
    }
}

int main(int argc, char *argv[]) {
    if (argc > 1) {
        if (load(argv[1]))
            save();
    }
}

Changing the fscanf() to a simpler fgets() might provide a marginal performance improvement, at the cost of more restrictive semantics for the dictionary format.

Sarad answered 14/7, 2016 at 16:15 Comment(1)
Thanks so much for the help! It was really what kind of solution that just lay on the surface and it is so really easy to get them!Apparitor
R
6

Your problem isn't your hash function; it's that your hash table is way too small.

From the sound of things, you have about 26 hash buckets for over 20,000 words. This places between 750 and 1000 words in each bucket. (Probably much more in some, as the hash function you're using is not uniform. There are very few words that start with x or q, for instance.)

Try expanding the hash table to 1000 entries (for instance), so that there are around 20 entries in each bucket. You will need a new hash function to do this; anything will work, but to work well it will need to generate values up to the size of the table. (Adding together the values of all the letters won't work, for instance, as it'll almost never reach 1000.)

Recollected answered 14/7, 2016 at 6:12 Comment(1)
Expanding the "hash table" would change the semantics (aka the sort order), but would not change the complexity of the algorithm. Increasing the dictionary size would still lead to potentially long running times. In fact, a custom built dictionary with 20k times the same word would show the same slow timings.Sarad
S
2

The problem is not in your hash function, nor in the size of your hash table, it is in your list management: your method for appending words to the corresponding lists has a complexity of O(N2).

By the way, your hash function is not used for hashing, but for dispatching. You are sorting the table only on the first letter of each word, keeping the words with the same initial in the same order. If you meant to sort the dictionary completely, you would still need to sort each list.

You can drastically improve the performance while keeping the same semantics by prepending to the lists and reversing the lists at the end of the parsing phase.

For a dictionary with 20 thousand words, the code below runs 50 times faster (as expected by the CS50 site):

#include <ctype.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LENGTH  46
#define ALPHALENGTH  26
typedef struct node {
    struct node *next;
    char word[LENGTH +1];
} node;

node *hashTable[ALPHALENGTH];

bool load(const char *dictionary) {
    FILE *f = fopen(dictionary, "r");

    if (f == NULL) {
        return false;
    }

    char word[LENGTH + 1];
    int hash = 0;

    for (int i = 0; i < ALPHALENGTH; i++) {
        hashTable[i] = NULL;
    }

    while (fscanf(f, "%46s", word) == 1) {
        node *new_node = malloc(sizeof(node));
        if (new_node == NULL)
            return false;
        // make every letter lowercase to get result from 0 - 25
        hash = tolower(word[0]) - 'a';
        strcpy(new_node->word, word);
        /* prepending to the list */
        new_node->next = hashTable[hash];
        hashTable[hash] = new_node;
    }

    for (int i = 0; i < ALPHALENGTH; i++) {
        node *n, *prev, *next;
        /* reverse list */
        for (prev = NULL, n = hashTable[i]; n != NULL; ) {
            next = n->next;
            n->next = prev;
            prev = n;
            n = next;
        }
        hashTable[i] = prev;
    }

    fclose(f);

    return true;
}

void save(void) {
    for (int i = 0; i < ALPHALENGTH; i++) {
        for (node *n = hashTable[i]; n != NULL; n = n->next) {
            puts(n->word);
        }
    }
}

int main(int argc, char *argv[]) {
    if (argc > 1) {
        if (load(argv[1]))
            save();
    }
}

Changing the fscanf() to a simpler fgets() might provide a marginal performance improvement, at the cost of more restrictive semantics for the dictionary format.

Sarad answered 14/7, 2016 at 16:15 Comment(1)
Thanks so much for the help! It was really what kind of solution that just lay on the surface and it is so really easy to get them!Apparitor

© 2022 - 2024 — McMap. All rights reserved.