I have the following statement to find unambiguous names in my data (~1 Million entries):
select Prename, Surname from person p1
where Prename is not null and Surname is not null
and not exists (
select * from person p2 where (p1.Surname = p2.Surname OR p1.Surname = p2.Altname)
and p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and inv_date IS NULL
Oracle shows a huge cost of 1477315000 and execution does not end after 5 minutes. Simply splitting the OR into an own exists subclause boosts performance to 0,5 s and costs to 45000:
select Prename, Surname from person p1
where Prename is not null and Surname is not null
and not exists (
select * from person p2 where p1.Surname = p2.Surname and
p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and not exists (
select * from person p2 where p1.Surname = p2.Altname and
p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and inv_date IS NULL
It's not my question to tweak this to the best, as it is only a seldomly executed query, and I know CONTACT is surpassing any index, but I just wonder where this high cost comes from. Both queries seem semantically equivalent to me.
EXISTS
short circuits andOR
does not (at least in SQL Server, I'm assuming Oracle is similar). By including theOR
in theEXISTS
sub it checks both options each time. Separating means it only checks the second if the first is false. – Canal