Is it possible to use a group by count in the havingPredicate for a CoreData fetch (for dupe detection)?
Asked Answered
P

3

5

For reference, the problem I'm trying to solve is efficiently finding and removing duplicates in a table that could have a lot of entries.

The table I am working with is called PersistedDay with a dayString object in it (it's a string. :-P). There are more columns that aren't relevant to this question. I'd like to find any PersistedDay's that have duplicates.

In SQL, this is one of the efficient ways you can do that (FYI, I can do this query on the CoreData backing SQLite DB):

SELECT ZDAYSTRING FROM ZPERSISTEDDAY GROUP BY ZDAYSTRING HAVING COUNT(ZDAYSTRING) > 1;

This returns ONLY the dayStrings that have duplicates and you can then get all of the fields for those objects by querying using the resulting day strings (you can use it as a sub query to do it all in one request).

NSFetchRequest seems to have all of the required pieces to do this too, but it doesn't quite seem to work. Here's what I tried to do:

NSManagedObjectContext *context = [self managedObjectContext];

NSFetchRequest *request = [[NSFetchRequest alloc] init];
NSEntityDescription *entity = [NSEntityDescription entityForName:@"PersistedDay" inManagedObjectContext:context];
[request setEntity:entity];

NSPropertyDescription* dayStringProperty = entity.propertiesByName[@"dayString"];

request.propertiesToFetch = @[dayStringProperty];
request.propertiesToGroupBy = @[dayStringProperty];
request.havingPredicate = [NSPredicate predicateWithFormat: @"dayString.@count > 1"];
request.resultType = NSDictionaryResultType;

NSArray *results = [context executeFetchRequest:request error:NULL];

That doesn't work. :-P If I try that I get an error "Unsupported function expression count:(dayString)" when trying to do the fetch. I don't think the dayString in "dayString.@count" even matters in that code above...but, I put it in for clarity (SQL count just operates on the grouped rows).

So, my question is: is this possible and, if so, what is the syntax to do it? I couldn't find anything in the CoreData docs to indicate how to do this.

I found one similar SO posts that I now unfortunately can't find again that was about running a count in a having clause (I don't think there was a group by). But, the poster gave up and did it a different way after not finding a solution. I'm hoping this is more explicit so maybe someone has an answer. :)

For reference, this is what I am doing for now that DOES work, but requires returning almost all the rows since there are very few duplicates in most cases:

NSManagedObjectContext *context = [self managedObjectContext];

NSFetchRequest *request = [[NSFetchRequest alloc] init];
NSEntityDescription *entity = [NSEntityDescription entityForName:@"PersistedDay"
                                          inManagedObjectContext:context];
[request setEntity:entity];

NSPropertyDescription* dayStringProperty = entity.propertiesByName[@"dayString"];

// Get the count of dayString...
NSExpression *keyPathExpression = [NSExpression expressionForKeyPath: @"dayString"]; // Does not really matter
NSExpression *countExpression = [NSExpression expressionForFunction: @"count:" arguments: [NSArray arrayWithObject:keyPathExpression]];
NSExpressionDescription *expressionDescription = [[NSExpressionDescription alloc] init];
[expressionDescription setName: @"dayStringCount"];
[expressionDescription setExpression: countExpression];
[expressionDescription setExpressionResultType: NSInteger32AttributeType];

request.propertiesToFetch = @[dayStringProperty, expressionDescription];
request.propertiesToGroupBy = @[dayStringProperty];
request.resultType = NSDictionaryResultType;

NSArray *results = [context executeFetchRequest:request error:NULL];

I then have to loop over the result and only return the results that have dayStringCount > 1. Which is what the having clause should do. :-P

NOTE: I know CoreData isn't SQL. :) Just would like to know if I can do the equivalent type of operation with the same efficiency as SQL.

Puiia answered 16/4, 2014 at 17:51 Comment(6)
I'm guessing the answer has something to do with using SUBQUERY in the NSPredicate. Have you tried that?Crybaby
Take a look at atomicbird.com/blog/icloud-complications-part-2, it seems to have some thought put into it :)Crybaby
I'm not sure how a subquery would help me here. Unless I could do a subquery that did a group by with a having clause with the count in it. ;-) You could also do a join with the same table to detect dupes, but I'm also not clear how you would do that with core data. Regarding that site, preventing dupes I am familiar with. The problem here is cleaning up existing dupes. Looks like his solution was to do what I did above (include the count in the query results). That still stinks since it will return a lot of results if the table is big. Thanks though!Puiia
Yeah, looking at it closer I don't see anyway that subquery helps either. From just playing around with it, I can't ever see a reason that you would use havingPredicate as it seems to be implemented. The best solution I could come up with is below, but it still results in a dictionary per unique key in the dictionary. It's not a full record, but it's still conceivably a lot of memory. You might get better results by using fetchBatchSize and a for..in loop.Crybaby
Something else you might consider if it's really an issue is using sqlite directly to build the list of duplicates to prune and then going back to core data to actually prune them. It's really not an ideal solution and obviously has the possibility of breaking in the future.Crybaby
Yeah, I thought about doing that...still considering it. Might be worth a try.Puiia
H
7

Yes it is possible. You cannot reference count as key path, however you can reference it as variable. Just like in SQL. In my example I have cities created with duplicate names.

let fetchRequest = NSFetchRequest(entityName: "City")

let nameExpr = NSExpression(forKeyPath: "name")
let countExpr = NSExpressionDescription()
let countVariableExpr = NSExpression(forVariable: "count")

countExpr.name = "count"
countExpr.expression = NSExpression(forFunction: "count:", arguments: [ nameExpr ])
countExpr.expressionResultType = .Integer64AttributeType

fetchRequest.resultType = .DictionaryResultType
fetchRequest.sortDescriptors = [ NSSortDescriptor(key: "name", ascending: true) ]
fetchRequest.propertiesToGroupBy = [ cityEntity.propertiesByName["name"]! ]
fetchRequest.propertiesToFetch = [ cityEntity.propertiesByName["name"]!, countExpr ]

// filter out group result and return only groups that have duplicates
fetchRequest.havingPredicate = NSPredicate(format: "%@ > 1", countVariableExpr)

Complete playground file at: https://gist.github.com/pronebird/cca9777af004e9c91f9cd36c23cc821c

Haletky answered 11/7, 2016 at 18:25 Comment(2)
I see. This basically looks exactly what I was trying to do except the expression is plopped in as an object in the format string instead of using the key path. Well, and in swift since it didn't exist when I originally wrote this. :-P Thanks for the playground...I don't have easy access to the original code. Appears to work so I'll accept this as the now correct answer. I'm curious if Apple just fixed the behavior since I wrote this or if I just had the syntax wrong. Oh well. :-PPuiia
@Puiia I have Objective-C version too, but it's much easier to run and test in Swift playground. I think playground runs some of iOS 8 sims, so this trick should have been working back then too.Haletky
C
5

Best I can come up with is:

NSError*                error;

NSManagedObjectContext* context = self.managedObjectContext;
NSEntityDescription*    entity = [NSEntityDescription entityForName:@"Event" inManagedObjectContext:context];

// Construct a count group field
NSExpressionDescription*    count = [NSExpressionDescription new];
count.name = @"count";
count.expression = [NSExpression expressionWithFormat:@"count:(value)"];
count.expressionResultType = NSInteger64AttributeType;

// Get list of all "value" fields (only)
NSPropertyDescription*  value = [entity propertiesByName][@"value"];

NSFetchRequest*         request = [[NSFetchRequest alloc] initWithEntityName:@"Event"];
request.propertiesToFetch = @[ value, count];
request.propertiesToGroupBy = @[ value ];
request.resultType = NSDictionaryResultType;
NSArray*                values = [context executeFetchRequest:request error:&error];

// Filter count > 1
values = [values filteredArrayUsingPredicate:[NSPredicate predicateWithFormat:@"count > 1"]];

// slice to get just the values
values = [values valueForKeyPath:@"value"];

But that's not really much different from what you're using.

Crybaby answered 16/4, 2014 at 22:13 Comment(3)
Yeah, if I don't get any better answers I'll just accept this one. Filtering is a little nicer than a for loop.Puiia
Not really sure why havingPredicate is really even there given that I can't find any cases where it has a different effect than just using predicate. I don't even find much mention of it on the web other than the documentation.Crybaby
Ok. I'll accept your answer. Maybe someone in the future can find some way to actually make a more efficient query, but for now I'm going with this approach. :-PPuiia
M
0

The best way finding duplicates in Core Data depends on your data. According to Efficiently Importing Data and assuming that you have to import less than 1000 PersistedDays, I suggest this solution:

NSFetchRequest* fetchRequest = [NSFetchRequest new];

[fetchRequest setEntity:[NSEntityDescription entityForName:@"PersistedDay" inManagedObjectContext:myMOC]];
[fetchRequest setSortDescriptors:@[[NSSortDescriptor sortDescriptorWithKey:@"dayString" ascending:NO]]];

NSArray* persistedDays = [myMOC executeFetchRequest:fetchRequest error:nil];

for (NSUInteger i = persistedDays.count - 1; i > 0; --i) {

    PersistedDay *currentDay = persistedDays[i];
    PersistedDay *nextDay = persistedDays[i-1];

    if ([currentDay.dayString isEqualToString:nextDay.dayString]) {
        /* Do stuff/delete with currentDay */
    }
}

For speed up can index dayString in Core Data.

You also can reduce the the data set if you remember a timestamp or a date of the last duplicate clean up:

[fetchRequest setPredicate:[NSPredicate predicateWithFormat:@"importDate > %@", lastDuplicateCleanUp];
Mayes answered 5/6, 2015 at 13:15 Comment(1)
Thanks, but this seems less efficient than using group by since you can than just use an integer check for count > 1 instead of a string comparison on every entry. I definitely will have > 1000 entries.Puiia

© 2022 - 2024 — McMap. All rights reserved.