How to keep string data for having best performance when selecting string list via LINQ by StartsWith and EndsWith queries

Asked 22/2, 2012 at 1:54 Answered 22/2, 2012 at 3:28

Solved c#performance linq c#-4.0 startswith

Now the question is pretty hard. I have a linq queries like the way below

var lstSimilars = from similarWords in lstAllWords

where similarWords.StartsWith(srWordLocal)

select similarWords;

foreach (string srVar in lstSimilars)
{
string srTempWord = srVar.Replace(srWordLocal, "");
if (dtWords.ContainsKey(srTempWord) == true)
{
csWords.updateSplitWord(srWordLocal + ";" + srTempWord, dtWords[srVar]);
}
}

lstSimilars = from similarWords in lstAllWords

where similarWords.EndsWith(srWordLocal)

select similarWords;

foreach (string srVar in lstSimilars)
{
string srTempWord = srVar.Replace(srWordLocal, "");
if (dtWords.ContainsKey(srTempWord) == true)
{
csWords.updateSplitWord(srWordLocal + ";" + srTempWord, dtWords[srVar]);
}
}

Now lstAllWords is a string list variable generated like the way below

        List<string> lstAllWords = new List<string>();


        for (int i = 0; i < dsWordsSplit.Tables[0].Rows.Count; i++)
        {
            lstAllWords.Add(dsWordsSplit.Tables[0].Rows[i]["Word"].ToString());
        }

My question is how should i keep that Words data for having best LINQ selection performance. I mean currently i am keeping it as a string list. But can i keep it different way and have better performance ?

dtWords is a dictionary object

C# C#-4.0 LINQ

Vexed answered 22/2, 2012 at 1:54 Comment(0)

If all you want is efficiently finding words that start or end with given substring, employing the SortedSet will help you do that in O(log(N)) time.

The idea is to put words in two SortedSets:

one for original words and
the other for reversed words.

Toy implementation:

class WordSet {

    public WordSet(IEnumerable<string> words) {
        m_OriginalWords = new SortedSet<string>(words);
        m_ReverseWords = new SortedSet<string>(words.Select(ReverseString));
    }

    /// <summary>
    ///     Finds all words that start with given prefix.
    /// </summary>
    public IEnumerable<string> FindPrefix(string prefix) {
        return FindImp(m_OriginalWords, prefix);
    }

    /// <summary>
    ///     Finds all words that end with the given suffix.
    /// </summary>
    public IEnumerable<string> FindSuffix(string suffix) {
        return FindImp(m_ReverseWords, ReverseString(suffix)).Select(ReverseString);
    }

    static IEnumerable<string> FindImp(SortedSet<string> word_set, string s) {
        if (s.CompareTo(word_set.Max) <= 0) {
            foreach (string word in word_set.GetViewBetween(s, word_set.Max)) {
                if (!word.StartsWith(s))
                    break;
                yield return word;
            }
        }
    }

    static string ReverseString(string src) {
        return new string(src.Reverse().ToArray());
    }

    readonly SortedSet<string> m_OriginalWords;
    readonly SortedSet<string> m_ReverseWords;

}

class Program {

    static void TestImp(string s, IEnumerable<string> words) {
        Console.Write(s);
        foreach (var word in words) {
            Console.Write('\t');
            Console.Write(word);
        }
        Console.WriteLine();
    }

    static void TestPrefix(WordSet word_set, string prefix) {
        TestImp(prefix, word_set.FindPrefix(prefix));
    }

    static void TestSuffix(WordSet word_set, string suffix) {
        TestImp(suffix, word_set.FindSuffix(suffix));
    }

    static void Main(string[] args) {

        var word_set = new WordSet(
            new[] {
                "a",
                "b",
                "ba",
                "baa",
                "bab",
                "bba",
                "bbb",
                "bbc",
            }
        );

        Console.WriteLine("Prefixes:");
        TestPrefix(word_set, "a");
        TestPrefix(word_set, "b");
        TestPrefix(word_set, "ba");
        TestPrefix(word_set, "bb");
        TestPrefix(word_set, "bc");

        Console.WriteLine("\nSuffixes:");
        TestSuffix(word_set, "a");
        TestSuffix(word_set, "b");
        TestSuffix(word_set, "ba");
        TestSuffix(word_set, "bb");
        TestSuffix(word_set, "bc");

    }

}

This prints:

Prefixes:
a       a
b       b       ba      baa     bab     bba     bbb     bbc
ba      ba      baa     bab
bb      bba     bbb     bbc
bc

Suffixes:
a       a       baa     ba      bba
b       b       bab     bbb
ba      ba      bba
bb      bbb
bc      bbc

If you have to search for infixes as well, then the above is not enough - you'll need a suffix tree or array, but this is no picnic implementing correctly and efficiently.

BTW, If the data happens to be in the database, you can let the DBMS do essentially the same thing by:

creating a computed column or virtual column that is reverse of the original word column,
indexing both original word column and the reversed word column (or, alternatively, using a function-based index if that's what your DBMS supports).
querying for prefix as: ORIGINAL_WORD_COLUMN LIKE 'pefix%'
and for suffix as: REVERSED_WORD_COLUMN LIKE 'reversed_suffix%'.

Posse answered 22/2, 2012 at 3:28 Comment(2)

Hello. Thanks a lot for the answer though i later discovered that i don't need to do both startswith or endswith. I just need one of them. So what would be your best suggestion for doing this. Using C# objects or MS-SQL ? Though your sorted set improved the software already. – Orectic 22/2, 2012 at 11:38

@MonsterMMORPG In the client-server environment, client-level filtering requires all the rows to be read from server's disk and transferred from server to client. If the server does the filtering, only the filtered rows are transferred (and if you did your indexing correctly, only filtered rows are even read from the disk). Obviously, the second scenario scales much better, so always prefer to do the filtering in the database. Whether you'll do it through "naked" SQL and ADO.NET, or through LINQ to SQL or Entity Framework or NHibernate etc... is up to you. – Posse 22/2, 2012 at 13:15

A string list should be sufficiently performant for selecting from, but you're adding some boxing/unboxing operations by selecting into and then iterating over a var. You can use a strongly-typed List<string> as your recipient of LINQ query results for a performance boost, but it'll likely only be noticeable for very large datasets.

Birdsall answered 22/2, 2012 at 1:59 Comment(5)

Thanks for the answer. I compared SQL server selection and LINQ selection. Currently LINQ is about 18% faster. – Orectic 22/2, 2012 at 2:0

Also when i do that it gives the error at this image : img40.imageshack.us/img40/6820/errorjj.png Please look at the image. Because of that error i can not define it as List<string> – Orectic 22/2, 2012 at 2:2

List<string> lstSimilars = (from similarWords in lstAllWords where similarWords.StartsWith(srWordLocal) select similarWords).ToList<string>(); – Birdsall 22/2, 2012 at 2:3

thanks working now though it did not make any speed effect :D – Orectic 22/2, 2012 at 2:12

Sorry, but the boxing information is incorrect. var has nothing to do with boxing. Strings are reference types, and boxing does not apply to reference types. Furthermore, if you're only going to iterate the "where" iterator once, there's no performance benefit to creating a list; in fact, there is a performance penalty of allocating the storage for the list. – Carrasco 22/2, 2012 at 21:0

Recommended topics

Hot tags