How to Use Effeciently Where Clause or Select in LINQ Parallel in Large Dataset
Asked Answered
O

2

7

I'm having approx 250,000 records as marked as Boss, each Boss has 2 to 10 Staff. Daily I need to get the details of the Staff. Approx there are 1,000,000 staff. I'm using Linq to get the Unique list of Staff who are worked in daily basis. Consider the following C# LINQ and Models

void Main()
{

    List<Boss> BossList = new List<Boss>()
    {
        new Boss()
        {
            EmpID = 101,
            Name = "Harry",
            Department = "Development",
            Gender = "Male",
            Employees = new List<Person>()
            {
                new Person() {EmpID = 102, Name = "Peter", Department = "Development",Gender = "Male"},
                new Person() {EmpID = 103, Name = "Emma Watson", Department = "Development",Gender = "Female"},

            }
        },
        new Boss()
        {
            EmpID = 104,
            Name = "Raj",
            Department = "Development",
            Gender = "Male",
            Employees = new List<Person>()
                    {
                        new Person() {EmpID = 105, Name = "Kaliya", Department = "Development",Gender = "Male"},
                        new Person() {EmpID = 103, Name = "Emma Watson", Department = "Development",Gender = "Female"},

                    }
        },

        ..... ~ 250,000 Records ......

    };

    List<Person> staffList = BossList
    .SelectMany(x =>
        new[] { new Person { Name = x.Name, Department = x.Department, Gender = x.Gender, EmpID = x.EmpID } }
        .Concat(x.Employees))
    .GroupBy(x => x.EmpID) //Group by employee ID
    .Select(g => g.First()) //And select a single instance for each unique employee
    .ToList();
}

public class Person
{
    public int EmpID { get; set; }
    public string Name { get; set; }
    public string Department { get; set; }
    public string Gender { get; set; }
}

public class Boss
{
    public int EmpID { get; set; }
    public string Name { get; set; }
    public string Department { get; set; }
    public string Gender { get; set; }
    public List<Person> Employees { get; set; }
}

In the above LINQ I'm getting the List of Distinct Employees or Staff, the list contains more than 1,000,000 records. From the Obtained List I need to search "Raj"

staffList.Where(m => m.Name.ToLowerInvariant().Contains("Raj".ToLowerInvariant()));

For this operation, it took more than 3 to 5 minutes to get the result.

How could I make it more efficient. Kindly assist me...

Oxytocin answered 9/3, 2016 at 11:6 Comment(20)
What's the objective here? Why do you need to get that many records to find raj?Ashurbanipal
In my real project there are so many properties are there in Person Model, for example LedgerAccounts, NoOfDaysPresent, etc., If Raj cames and asks the attendence, now I need to fetch the Raj's Record...Oxytocin
Ok. So Raj is the boss? and you are trying to get attendance of his subordinates?Ashurbanipal
For this kind of operation, my suggestion is putting data on elastic search, rather than putting in C# collection on memory, this way would be better for full text index searchQuincunx
@KosalaW Here we are not speaking about boss or assistant. all are the Employees of the company. I need to fetch all those details.Oxytocin
From database perspective, you must have an index on the Name, since you need to search by name. Try to narrow down the search by using more fields if possible. Limit the search results to 100-200. Even if a search returns more than that, no one is going to look at them at once.Ashurbanipal
Where are you persisting the data? If it's in a database then you should push the querying out of your application memory down to the database. You can optimise with indexes then too.Erymanthus
Try same query in SQL. Time it and check the query plan. Then do suggested optimisations.Ashurbanipal
@KosalaW Ok I try it. Is there is any way in Parallel Linq to optimize the search?Oxytocin
Parallel linq will not help in this case. The performance issue is at the database level.Ashurbanipal
Parallel does help, though. you just need to partitions for search. The performance will improve a little bit. But The best approach is using full text index.Quincunx
I was talking about doing the "whole thing" using a single query. So partitioning should not be required at all.Ashurbanipal
you need to write the sql for this.... in other words "The performance issue is at the database level", dont try and load all of them in mem, no one is going to view 10 000 at once. think about it logically, as to what one person would view at one sitting.Bedcover
@CuongLe the performance increase from Parallel processing in the case wouldn't warrant the effort!!!. the problem is the amount of records and the logic used to retrieve them. This is a data logical issue not coding one.. its not CPU intensive... its data heavy! not! CPU heavy.Bedcover
@Bedcover Yes I agree with you, no one can view 10,000 records at once. But I need to search a Particular record.Oxytocin
@B.Balamanigandan, usually what helps a lot!!!... is create a simple UI frontend mockup... that will help you with working out what you actually need from the db. So you have a search box.... what field does this search on. is it more than one..., then implement paging. no more than 100 results at a time. Don't try and do this in code.... build the SQL then see if you can convert it to a model ef based way. FYI at CuongLe "full text index" is NOT needed, Hes not searching on paraphrase of text.Bedcover
@Oxytocin how do you know what person belong to which bosses from the db? there seems to be a design flaw with the code given.Bedcover
@Bedcover - In the real scenario, It shows the result with Name and Mobile Number. Based on mobile number and name we find the record. We are hitting the Service Once, after that we did the search query in the local, In our database we are having more than 30,00,000 Records in the Employee Table.Oxytocin
@Oxytocin Include the sql which you would use to query what you want... dnt get harped up on the amount of records...sql is extremely quick.. sounds like you need and index on name and mobile number tho.Bedcover
Can you explain the misplaced commas in here 2,50,000 and 10,00,000 and 30,00,000? Are those integers of 250000 and 1000000 and 3000000 or properly 250,000 and 1,000,000 and 3,000,000 or is that a non-English decimal separator? (which would seem odd) So for 250,000 * 5 (the 2-10 staff) that is 500,000 staff minimum or 2.5 million max with 10 staff...Daigle
K
2

If you change Boss to inherit from Person ( public class Boss : Person ) not only do you not need to duplicate your properties in Person and Boss, you don't have to create all new Person instances for each Boss, because a Boss is already a Person:

IEnumerable<Person> staff = BossList 
    .Concat(BossList
        .SelectMany(x => x.Employees)
    )
    .DistinctBy(p => p.EmpId)
    .ToList()

Where DistinctByis defined as

public static IEnumerable<TSource> DistinctBy<TSource, TKey>
    (this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
    var seenKeys = new HashSet<TKey>();
    foreach (TSource element in source)
    {
        if (seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

Also, in your comparison, you're converting every Name to lowercase and doing the comparison - that's a lot of string creation that you don't need. Instead, try something like

staffList.Where(m => m.Name.Equals("Raj", StringComparison.InvariantCultureIgnoreCase));

Also, be aware that your use of Contains would also match names like Rajamussenand mirajii - possibly not what you were expecting.

Koel answered 13/3, 2018 at 19:53 Comment(0)
F
0

Would it work for you to change staffList to a Dictionary? A better search algorithm as those from Dictionary and SortedList would get you the most improvement.

I've tested the code below and it runs in just a few seconds.

    private static void Main()
    {

        List<Boss> BossList = new List<Boss>();
        var b1 = new Boss()
        {
            EmpID = 101,
            Name = "Harry",
            Department = "Development",
            Gender = "Male",
            Employees = new List<Person>()
            {
                new Person() {EmpID = 102, Name = "Peter", Department = "Development", Gender = "Male"},
                new Person() {EmpID = 103, Name = "Emma Watson", Department = "Development", Gender = "Female"},

            }
        };

        var b2 = new Boss()
        {
            EmpID = 104,
            Name = "Raj",
            Department = "Development",
            Gender = "Male",
            Employees = new List<Person>()
            {
                new Person() {EmpID = 105, Name = "Kaliya", Department = "Development", Gender = "Male"},
                new Person() {EmpID = 103, Name = "Emma Watson", Department = "Development", Gender = "Female"},

            }
        };

        Random r = new Random();
        var genders = new [] {"Male", "Female"};

        for (int i = 0; i < 1500000; i++)
        {
            b1.Employees.Add(new Person { Name = "Name" + i, Department = "Department" + i, Gender = genders[r.Next(0, 1)], EmpID = 200 + i });
            b2.Employees.Add(new Person { Name = "Nam" + i, Department = "Department" + i, Gender = genders[r.Next(0, 1)], EmpID = 1000201 + i });
        }

        BossList.Add(b1);
        BossList.Add(b2);

        Stopwatch sw = new Stopwatch();
        sw.Start();

        var emps = BossList
            .SelectMany(x =>
                new[] {new Person {Name = x.Name, Department = x.Department, Gender = x.Gender, EmpID = x.EmpID}}
                    .Concat(x.Employees))
            .GroupBy(x => x.EmpID) //Group by employee ID
            .Select(g => g.First());

        var staffList =  emps.ToList();
        var staffDict = emps.ToDictionary(p => p.Name.ToLowerInvariant() + p.EmpID);
        var staffSortedList = new SortedList<string, Person>(staffDict);

        Console.WriteLine("Time to load staffList = " + sw.ElapsedMilliseconds + "ms");

        var rajKeyText = "Raj".ToLowerInvariant();
        sw.Reset();
        sw.Start();

        var rajs1 = staffList.AsParallel().Where(p => p.Name.ToLowerInvariant().Contains(rajKeyText)).ToList();
        Console.WriteLine("Time to find Raj = " + sw.ElapsedMilliseconds + "ms");

        sw.Reset();
        sw.Start();

        var rajs2 = staffDict.AsParallel().Where(kvp => kvp.Key.Contains(rajKeyText)).ToList();
        Console.WriteLine("Time to find Raj = " + sw.ElapsedMilliseconds + "ms");

        sw.Reset();
        sw.Start();

        var rajs3 = staffSortedList.AsParallel().Where(kvp => kvp.Key.Contains(rajKeyText)).ToList();
        Console.WriteLine("Time to find Raj = " + sw.ElapsedMilliseconds + "ms");

        Console.ReadLine();
    }

    public class Person
    {
        public int EmpID { get; set; }
        public string Name { get; set; }
        public string Department { get; set; }
        public string Gender { get; set; }
    }

    public class Boss
    {
        public int EmpID { get; set; }
        public string Name { get; set; }
        public string Department { get; set; }
        public string Gender { get; set; }
        public List<Person> Employees { get; set; }
    }
}

Output1:

enter image description here

Output2 (using .AsParallel() on searches):

enter image description here

In other words, if you can't use some faster data structure, up can speed your search up just by changing form

staffList.Where(m => m.Name.ToLowerInvariant().Contains("Raj".ToLowerInvariant()));

to

staffList.AsParallel().Where(m => m.Name.ToLowerInvariant().Contains("Raj".ToLowerInvariant()));
Farlie answered 9/3, 2016 at 13:16 Comment(15)
No, here I'm using Search Text Box based on Search Text, It returns the result. I don't input the exact name... That's the issue...Oxytocin
I've made some adjustments to compare performance. Also, why can't you use Dictionary or SortedList? You want to find all employees where their name contains a substring, right? You won't get the O(1) search, but it will be faster than the list search.Farlie
Thanks for your time spending. The Dictionary can check the key is exist or not. If the Key is exist then we can get the result. Here I feed only the rough text. Then how it could return the result ? you are checking kvp.Key.Contains(rajKeyText) the exact phrase of Key... Here its not possible. Because more than one Raj will be there...Oxytocin
That is why I concatenated the EmpID to then Name to get the keys. You could even remove the EmpID text from the key before checking if key.Contains(some_text).Farlie
If you need a more flexible search method, I recommend you do something like @Cuong Le suggested when he commented on your question.Farlie
If I use Dictionary means I need to give the exact word, otherwise it won't return the result am I right?Oxytocin
Not really. That is true only if you want to get the value associated to a key, which is not what we need to do in your case. We need to get all items (Person) in the dictionary where it`s key contains some text.Farlie
As an alternative, you could just add the AsParallel() to your original code. That should speed things up, as you can see in Output2 from my answer.Farlie
@EduardoCMB: AsParallel is absolutely not required in this case. This should be done in a simple LIKE operator with top 100 or 200 in SQL terms. Linq Contains gets translated in to sql LIKE operator. You can combine that with Take(100).Ashurbanipal
Could you please provide more details to what it is you are really trying to do and some context? If you are running a search on a in-memory object, then AsParallel would speed things up. If you are running a SQL query (under the hood somehow), then you should look into indexing and full-text search.Farlie
Also, is there something I'm missing from your question's title "How to Use Effeciently Where Clause or Select in LINQ Parallel in Large Dataset"?Farlie
Even after your update, I can't see why you can't do: staffList.AsParallel().Where(m => m.Name.ToLowerInvariant().Contains("Raj".ToLowerInvariant())).ToList();Farlie
@Oxytocin you are right a dictionary is useless here. Look what you need to be doing, it the search on the database names instead. 1) Index whatever fields you will query on the database 2) make the interface show current results whilst typing if possible (limited to 5 or so), so the user types and get's a dropdown with basic information about the first 5 users it found, one more thing, the user will be at a specific location right? so you can use that as another search filter.Choe
3) You use linq to sQL right? Don't be doing this search in lists. Do it with IQueryable objects and then use .ToList when you have the conditions all set up, then it will send the one query to the database, and return just 5 users, every time the user presses "search", then use whatever ID the user selects to get the one record from the databaseChoe
Can you please show us how you are getting the data from the databaseChoe

© 2022 - 2024 — McMap. All rights reserved.