.NET library for text algorithms?
Asked Answered
P

6

28

Do you know any .NET library for text algorithms??
Especially I'm interested in strings match, and full-text-search algorithms like

  • Bitap algorithm
  • Levenshtein distance
  • Damerau–Levenshtein distance

I know the one I have mentioned are pretty simple to code, but there are hundreds of text algorithms, i don't want to code them all by myself.
If there is no such .NET library known, you can mention C, C++ library, coding wrapper will be easer than coding from zero.

Paraldehyde answered 22/12, 2010 at 10:39 Comment(0)
P
4

I managed to find implementations of most algorithms i need using combination of WikiPedia + Google Code search.

http://en.wikipedia.org/wiki/Category:Algorithms_on_strings
http://www.google.com/codesearch

Though it's strange that no one has created project on this subject, where interested people could collaborate on this.

Paraldehyde answered 22/12, 2010 at 12:33 Comment(0)
P
18

You may be interested in checking out the google-diff-match-patch library on Google Code. They have an implementation of Myer's diff algorithm and it claims to also implement a Bitap algorithm "at the heart".

It has the C# source that you're looking for as well as implementations in Java, C++, Lua & Python. Although I don't have the best understanding of how to use Bitap in practice (there are demos in the Google Code project) I think you'll be most interested in the match functions starting around line 1476 of the current version.

UPDATE:

A little digging found an implementation of Levenshtein in C# on CodeProject.

Also, this C# class file contains an implementation of Levenshtein on SourceForge. The implementation is part of the Corsis (aka Tenka Text) project. Author claims that the YetiLevenshtein method (around line 741) is 2x to 10x faster than the implementation used in the CodeProject version of the algorithm referenced above.

UPDATE #2:

I just discovered the wikibook Algorithm implementation with it's C# version of Levenshtein Distance and had to include it because it looks pretty straight and to the point. This wikibook looks like a great reference to keep on hand in general.

Levenshtein Distance in C# (courtesy of Wikibooks)

    private Int32 levenshtein(String a, String b)
    {

        if (string.IsNullOrEmpty(a))
        {
            if (!string.IsNullOrEmpty(b))
            {
                return b.Length;
            }
            return 0;
        }

        if (string.IsNullOrEmpty(b))
        {
            if (!string.IsNullOrEmpty(a))
            {
                return a.Length;
            }
            return 0;
        }

        Int32 cost;
        Int32[,] d = new int[a.Length + 1, b.Length + 1];
        Int32 min1;
        Int32 min2;
        Int32 min3;

        for (Int32 i = 0; i <= d.GetUpperBound(0); i += 1)
        {
            d[i, 0] = i;
        }

        for (Int32 i = 0; i <= d.GetUpperBound(1); i += 1)
        {
            d[0, i] = i;
        }

        for (Int32 i = 1; i <= d.GetUpperBound(0); i += 1)
        {
            for (Int32 j = 1; j <= d.GetUpperBound(1); j += 1)
            {
                cost = Convert.ToInt32(!(a[i-1] == b[j - 1]));

                min1 = d[i - 1, j] + 1;
                min2 = d[i, j - 1] + 1;
                min3 = d[i - 1, j - 1] + cost;
                d[i, j] = Math.Min(Math.Min(min1, min2), min3);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];

    }
Phalanstery answered 22/12, 2010 at 11:13 Comment(3)
Yeah, that's a good start, this lib contains lot's of useful code.Paraldehyde
Considering that this library is used by Google Docs (according to the project home page) it's bound to be good enough for your string matching needs.Phalanstery
You know what else would be an interesting read? The source code for the diff feature used on SO (Example: stackoverflow.com/posts/4508581/revisions)Phalanstery
P
4

I managed to find implementations of most algorithms i need using combination of WikiPedia + Google Code search.

http://en.wikipedia.org/wiki/Category:Algorithms_on_strings
http://www.google.com/codesearch

Though it's strange that no one has created project on this subject, where interested people could collaborate on this.

Paraldehyde answered 22/12, 2010 at 12:33 Comment(0)
D
3

If you're doing string matching, Lucene.Net is worth a look.

However, I know this is not exactly what you're after, and while you can find most of those algorithms in some C# form around, I know of no library incorporating them (I've tended to keep a couple of these in my personal library).

Out of interest, why would you ever need more than one of these full-match algorithms with a couple of threshold parameters?

Disrepair answered 22/12, 2010 at 10:58 Comment(0)
F
2

I suggest SimMetrics library, it has many different algorithms for string matching. Available also on NuGet.

Short description:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

GPLv2 license.

Fringe answered 28/6, 2016 at 8:19 Comment(0)
H
1

here is one I implemented for Levenshtein / Damerau–Levenshtein distance:

    public static int GetDistance(string left, string right, bool isDamerauDistanceApplied)
    {
        if (left.Length == 0) return right.Length;
        if (right.Length == 0) return left.Length;

        var lenLeft = left.Length;
        var lenRight = right.Length;

        var matrix = new int[lenLeft + 1, lenRight + 1];

        for (var i = 0; i <= lenLeft; i++)
            matrix[i, 0] = i;

        for (var i = 0; i <= lenRight; i++)
            matrix[0, i] = i;

        for (var i = 1; i <= lenLeft; i++)
        {
            for (var j = 1; j <= lenRight; j++)
            {
                var cost = (left[i - 1] == right[j - 1]) ? 0 : 1;

                matrix[i, j] = Math.Min(Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1), matrix[i - 1, j - 1] + cost);

                if (isDamerauDistanceApplied)
                {
                    // Fixed for string base 0 index.
                    if (i > 1 && j > 1 && left[i - 1] == right[j - 2] && left[i - 2] == right[j - 1])
                    {
                        matrix[i, j] = Math.Min(matrix[i, j], matrix[i - 2, j - 2] + cost);
                    }
                }
            }
        }

        return matrix[lenLeft, lenRight];
    }
Honorary answered 7/1, 2014 at 21:39 Comment(0)
D
0

I found and used the following .NET library implementing Aho-Corasick text matching from Tom Petricek on a problem I had. It worked great for me.

Dewberry answered 22/12, 2010 at 13:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.