Using "IN" in a WHERE clause where the number of items in the set is very large
Asked Answered
M

10

17

I have a situation where I need to do an update on a very large set of rows that I can only identify by their ID (since the target records are selected by the user and have nothing in common other than it's the set of records the user wanted to modify). The same property is being updated on all these records so I would I like to make a single UPDATE call.

Is it bad practice or is there a better way to do this update than using "WHERE IN (1,2,3,4,.....10000)" in the UPDATE statement?

Would it make more sense to use individual update statements for each record and stick them into a single transaction? Right now I'm working with SQL Server and Access but,if possible, I'd like to hear more broad best-practice solutions across any kind of relational database.

Mapp answered 10/2, 2009 at 12:55 Comment(2)
What kind of "very large set" are we talking about here? Is it thousands, millions or trillions of values?Weismannism
Yeah, thousands. My sample up there has "10000" as the last index. I think that will be the maximum number - at least in the short term.Mapp
A
9

I would always use

WHERE id IN (1,2,3,4,.....10000)

unless your in clause was stupidly large, which shouldn't really happen from user input.

edit: For instance, Rails does this a lot behind the scenes

It would definitely not be better to do separate update statements in a single transaction.

Alms answered 10/2, 2009 at 12:59 Comment(1)
You needed 10000 sql parameters if you wanted to do this in a safe way, i'd use a table valued parameter instead of WHERE id IN (@val1,@val2,@val3,@val4,..... @val10000). That would be much more efficient.Bravado
K
16

Another alternative is to store those numbers in a temp table and use it in a join to do the update. If you are able to execute a single update statement is definitely better than executing one statement per record.

Kearse answered 10/2, 2009 at 12:57 Comment(5)
I like this idea a lot but I'm not clear if it will be supported on Access. Sadly, we're chained to supporting access.Mapp
Access will do this. I often use a temp table -- even a permanently defined one -- bound to a UI control. So when the user is selecting the items to be processed, those items are getting flagged in or inserted into the temp table. When the user clicks OK, one UPDATE query joined to the temp table.Marvismarwin
Just be sure that if the table you are updating is a linked table on a SQL Server that the temporary table is also there. Otherwise, Access may try to pull the whole table into memory to do the update to then send back to SQL Server. You can also use a perm table with a transaction identifierInsomnolence
(con't) - Let me know if you want more information on the permanent table method and I'll post it as an answerInsomnolence
Doesn't updating the temp table result in the same question as updating the original table? You either have to insert many IDs in one statement or do them in many statements. How is that better than the original?Walley
A
9

I would always use

WHERE id IN (1,2,3,4,.....10000)

unless your in clause was stupidly large, which shouldn't really happen from user input.

edit: For instance, Rails does this a lot behind the scenes

It would definitely not be better to do separate update statements in a single transaction.

Alms answered 10/2, 2009 at 12:59 Comment(1)
You needed 10000 sql parameters if you wanted to do this in a safe way, i'd use a table valued parameter instead of WHERE id IN (@val1,@val2,@val3,@val4,..... @val10000). That would be much more efficient.Bravado
A
5

How do you generate the IN clause?

If there is there another SELECT statement that generates those values, you could simply plug that into the UPDATE like so:

UPDATE TARGET_TABLE T
SET
  SOME_VALUE = 'Whatever'
WHERE T.ID_NUMBER IN(
                    SELECT ID_NUMBER  --this SELECT generates your ID #s.
                    FROM SOURCE_TABLE
                    WHERE SOME_CONDITIONS
                    )

In some RDBMses, you'll get better performance by using the EXISTS syntax, which would look like this:

UPDATE TARGET_TABLE T
SET
  SOME_VALUE = 'Whatever'
WHERE EXISTS (
             SELECT ID_NUMBER  --this SELECT generates your ID #s.
             FROM SOURCE_TABLE S
             WHERE SOME_CONDITIONS
               AND S.ID_NUMBER =  T.ID_NUMBER
             )
Adriel answered 10/2, 2009 at 13:10 Comment(2)
I'm definitely going to keep this in mind for future queries but, no, the items are coming straight from the user interface as the user selects multiple items in a list.Mapp
Ah, that's a shame. Then I'd go for the temp table approach suggested by ocdecio. Just add stuff to the temp table as the user makes their selections.Adriel
T
2

In Oracle there is a limit of values you can put into a IN clause. So you better use a OR , x=1 or x=2 ... those are not limited, as far as I know.

Tomy answered 10/2, 2009 at 12:59 Comment(1)
1000 is the limit in Oracle for one set but you can do in Oracle: where id in (1,2,...,1000) or id in (1001,...,2000) or id in (....) ... It is however better to use jimmyorr's solution.Dominga
F
2

I would use a table-variable / temp-table; insert the values into this, and join to it. Then you can use the same set multiple times. This works especially well if you are (for example) passing down a CSV of IDs as varchar. As a SQL Server example:

DECLARE @ids TABLE (id int NOT NULL)

INSERT @ids
SELECT value
FROM dbo.SplitCsv(@arg) // need to define separately

UPDATE t
SET    t. // etc
FROM   [TABLE] t
INNER JOIN @ids #i ON #i.id = t.id
Fransen answered 10/2, 2009 at 13:2 Comment(1)
There is a UDF that could be used for SplitCsv here -#520269Franchot
W
2

Without knowing what a "very large" number of ID's might be, I'd venture a guess. ;-)

Since you are using Access as a database, the number of ID's can't be that high. Assuming we're talking about less than, say 10,000 numbers and we should know the limitations of the containers to hold the ID's (what language is used for the front end?), I'd stick to one UPDATE statement; if that is most readable and easiest to perform maintenance on later. Otherwise I'd split them into multiple statements using some clever logic. Something like split the statement into multiple statements with in one, ten, hundred, thousand... ID's per statement.

Then, I'd leave it to the DB optimiser to execute the statement(s) as efficient as possible. I would probably do an 'explain' on the query / queries to make sure nothing silly is going on though.

But in my experience, it is quite often OK to leave this kind of optimisation to the database manager itself. The one thing that takes the most time is usually the actual connection to the database, so if you can execute all queries within the same connection it is normally no problems. Make sure you send off all UPDATE statements before you start to look into and wait for any result sets coming back though. :-)

Weismannism answered 10/2, 2009 at 13:40 Comment(2)
A Jet database's Autonumber field is a special case of a Jet long integer field and has a range of -2,147,483,648 to 2,147,483,647 -- that's a lot of numbers. That seems like a VERY LARGE number to me.Yoga
Yes, that is a rather large number (or a long integer...). But have you tried to use anything near that amount of rows in an Access database? ;-) That was more what I was thinking about.Weismannism
A
1

In general there are several things to consider.

  1. The statement parsing cache in the DB. Each statement, with a different number of items in the IN clause, has to be parsed separately. You ARE using bound variables instead of literals, right?
  2. Some Databases have a limit on the number of items in the IN clause. For Oracle it's 1000.
  3. When updating you lock records. If you have multiple separate update statements you can have deadlocks. This means you have to be careful about the order in which you issue your updates.
  4. Round-trip latency to the database can be high, even for a very fast statement. This means it's often better to manipulate lots of records at once to save trip-time.

We recently changed our system to limit the size of the in-clauses and always use bound variables because this reduced the number of different SQL statements and thus improved performance. Basically we generate our SQL statements and execute multiple statements if the in-clause exceeds a certain size. We don't do this for updates so we haven't had to worry about the locking. You will.

Using a temp table may not improve performance because you have to populate the temp table with the IDs. Experimentation and performance tests can tell you the answer here.

A single IN clause is very easy to understand and maintain. This is probably what you should worry about first. If you find that the performance of the queries is poor you might want to try a different strategy and see if it helps, but don't optimize prematurely. The IN-clause is semantically correct so leave it alone if it isn't broken.

Asturias answered 10/2, 2009 at 14:26 Comment(0)
H
1

If you were on Oracle, I'd recommend using table functions, similar to Marc Gravell's post.

-- first create a user-defined collection type, a table of numbers
create or replace type tbl_foo as table of number;

declare
  temp_foo tbl_foo;
begin
  -- this could be passed in as a parameter, for simplicity I am hardcoding it
  temp_foo := tbl_foo(7369, 7788);

  -- here I use a table function to treat my temp_foo variable as a table, 
  -- and I join it to the emp table as an alternative to a massive "IN" clause
  select e.*
    from emp e,
         table(temp_foo) foo
   where e.empno = foo.column_value;
end;
Handstand answered 10/2, 2009 at 15:10 Comment(1)
Hm. He asked for "broad best-practice solutions across any kind of relational database". I'm not sure if Oracle's the only db with table functions, but this advice would apply to any db that supports that sort of feature. I gave an Oracle example because that's what I work on mostly. :\Handstand
N
1

There are multiple ways of accommodating a large set of values in a where condition

  1. Using Temp Tables

Insert the values into a temp table with a single column.

Create a UNIQUE INDEX on that particular column.

INNER JOIN the required table with the newly created temp table

  1. Using array-like functionality in SQL Server

    SQL does support an array like functionality

check this link for full documentation.

SAMPLE SYNTAX :

Create TABLE #IDs (id int NOT NULL)
DECLARE @x varchar(max) = '' 
DECLARE @xParam XML;
SELECT @xParam = CAST('<i>' + REPLACE(@x, ',', '</i><i>') + '</i>' AS XML)
INSERT into #IDs
SELECT x.i.value('.','NVARCHAR(100)') as key FROM @xParam .nodes('//i') x(i)
CREATE UNIQUE INDEX IX_#IDs ON #IDs (ID ASC) 

Query using

SELECT A.Name, A.Age from Table A INNER JOIN #IDs id on id.id = A.Key 
Negligent answered 24/9, 2018 at 15:24 Comment(0)
Y
0

I don't know the type of values in your IN list. If they are most of the values from 1 to 10,000, you might be able to process them to get something like:

WHERE MyID BETWEEN 1 AND 10000 AND MyID NOT IN (3,7,4656,987)

Or, if the NOT IN list would still be long, processing the list and generating a bunch of BETWEEN statements:

WHERE MyID BETWEEN 1 AND 343 AND MyID BETWEEN 344 AND 400 ...

And so forth.

Last of all, you don't have to worry about how Jet will process an IN clause if you use a passthrough query. You can't do that in code, but you could have a saved QueryDef that is defined as a passthrough and alter the WHERE clause in code at runtime to use your IN list. Then it's all passed off to SQL Server, and SQL Server will decide best how to process it.

Yoga answered 11/2, 2009 at 21:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.