Why does LINQ GroupBy produce different results when preceded by ToArray()?
Asked Answered
A

2

6

I'm pulling some data from SQL using Entity Framework.

I've written some code that looks like the snippet below. Note that Something is a varchar value from the database. Furthermore, I think it may be relevant that every value in Something consists of ten digits, a hyphen, and then two more digits, such as "0123456789-01". (Don't remind me this is bad design; I'm not the architect of this database.)

var X = Entities.MyView
    .Select(x => x.Something)
    .Distinct();
// 5850 elements

var Y = Entities.MyView
    .GroupBy(x => x.Something);
// 5850 elements

var Z = Entities.MyView
    .ToArray()
    .GroupBy(x => x.Something);
// 5727 elements

// Added in response to user995219's Answer
var ZZ = Entities.MyView
    .GroupBy(x => x.Something)
    .ToArray();
    // 5850 elements

The first statement pulls down the distinct values of the Something column from the view MyView. It gets me the result 5850 elements, which is what I expect.

The second statement pulls down the whole view and groups it by unique values of Something, yielding 5850 groupings, which is what I expect.

The third statement is exactly the same, except it calls ToArray on the values from MyView. Normally, I would do this when I expect the calling code to use the entirety of the data and don't want to deal with any Lazy Loading SNAFUs. However, when I run this code in the debugger, this third statement yields 5727 groupings, 123 less than I expect.

Edit:The fourth statement simply reverses the order of the GroupBy and ToArray method chaining. Making this small change, I get the number of rows that I expect, but the real problem that this question cannot be solved this way because the real problem has code that associates data from this SQL call with other data and returns them in a single object.

I wish to understand how it is that using ToArray causes me to get the wrong number of results.

Follow-up: In response to user995219's answer, I rewrote MyView so that it has an additional column: SomethingInt, which simply containins a 32-bit integer formed by omitting the hyphen from Something and considering the result as a single integer. Then, the LINQ code does the GroupBy on the newly minted integer.

However, even with this change, I'm still having the same problem.

var A = Entities.MyView
    .ToArray();
// Returns 17893 elements, the same as if I ran SELECT * FROM MyView

var array0 = A.Select(x => x.SomethingInt).Distinct();
// Returns 5727 elements when I expect 5850

In SQL, SELECT COUNT(DISTINCT(SomethingInt)) FROM MyView; returns 5850, as I expect.

This should take any kind of string-comparison in the .net framework out of the picture, but the problem persists.

Antihelix answered 7/2, 2014 at 1:37 Comment(4)
has third statement correct count before GroupBy ? group by needs to IEqualityComparer for abjects when do that in memory, however your group by key is a primitive type, whereas you deal with entityframework could you answer this: what are EdmFunctions for?Sciolism
I have edited my post to clarify that Something that I group by is a varchar in SQL. And I'm sorry, but I don't know the answer to the question you link to.Antihelix
I would say that when group by is executed without toarray it is converted to sql expression. When you apply toarray group by is executed on object's property. So if you did code first and made some extra work like trim string, data could be grouped more. Please +1 for my fantasy.Mexican
I have updated my question with the information that Something contains strings of 10 digits, a hyphen, and then two digits.Antihelix
A
2

I found the answer. user995219's answer was useful, but not the full explanation.

Apparently, LINQ methods check the contents of what they are operating on. In my case, I'm using classes generated by Entity Framework. These have "Entity Keys", which allow the .net framework to distinguish between two rows that have the same contents and two instances of the same row.

In my case, I was using a complicated view and the .net framework inferred the entity keys incorrectly and then discarded rows because it thought they were the same.

The solution for me was to modify my view so that there is a GUID that uniquely identifies each row and use the GUIDs as an entity key.

Antihelix answered 13/2, 2014 at 2:46 Comment(0)
H
5

The ToArray statement executes the query and the group by is performed in memory using string equality comparer. The first two statements perform the group by at SQL level. There are differences: String Comparison differences between .NET and T-SQL?

Particularly concerning collation.

Hemipode answered 7/2, 2014 at 16:28 Comment(0)
A
2

I found the answer. user995219's answer was useful, but not the full explanation.

Apparently, LINQ methods check the contents of what they are operating on. In my case, I'm using classes generated by Entity Framework. These have "Entity Keys", which allow the .net framework to distinguish between two rows that have the same contents and two instances of the same row.

In my case, I was using a complicated view and the .net framework inferred the entity keys incorrectly and then discarded rows because it thought they were the same.

The solution for me was to modify my view so that there is a GUID that uniquely identifies each row and use the GUIDs as an entity key.

Antihelix answered 13/2, 2014 at 2:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.