Clustered indexes on non-identity columns to speed up bulk inserts?
Asked Answered
J

6

8

My two questions are:

  • Can I use clustered indexes to speed up bulk inserts in big tables?
  • Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).

It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.

I have been working to optimize these tables for the following functions:

  1. Queries that are run on subchunks
  2. "Benchmarking" queries that are run on the table as a whole
  3. Inserting/removing big chunks of data.

For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.

But for 3) I have struggled to find a good solution. My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.

At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.

I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?

I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?

Thanks a lot for any suggestions or explanations.

Jewelfish answered 17/9, 2010 at 8:26 Comment(4)
Have you looked into partitioned Views to deal with your "chunks" issue or are these not suitable?Cholecalciferol
I don't think I can use them in SQL Server Standard Edition.Jewelfish
Yes these are available in standard edition.Cholecalciferol
@Martin Smith: to use partitioned views would probably mean to divide my data over several tables, one per chunk, and concatenate them with UNION ALL in a view, right? I have tried this, it works well. Sort of a poor man's table partitioning.Jewelfish
J
4

Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.

Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.

I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.


Update: After a year of experience with this design I can say that for this approach to work, it is necessary to schedule regular rebuilding of all the indexes (we do it once a week). Otherwise, the indexes become fragmented very soon and performance is lost. Nevertheless, we are in a process of migration to a new database design with partitioned tables, which is basically better in every way - except for the Enterprise Server license cost, but we've already forgotten about it by now. At least I have.

Jewelfish answered 29/9, 2010 at 21:34 Comment(2)
That's exactly right. Look at just the Data Model in the Links at the end of this answer. CIs were designed for Relational databases; notice the keys. Particular good for any range query; data distribution (what you call "chunking"; insert load spread; self-trimming at the page & extent level. The one thing you should not do is cluster on a monotonic key (the opposite of its design). There's a bit more between the CI and the "hard disk", but you will get to that in time.Sitnik
Thanks. Still, the CI on my "chunks" is not giving me the performance I want and after some preliminary testing with SQL Server Enterprise Table Partitioning I am inclined to go for that - especially since it will allow me to remove and insert "chunks" without locking the whole table for a long time. Would you follow that reasoning or would you suggest something different?Jewelfish
D
1

A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.

Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.

A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.

In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.

Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.

Dwayne answered 17/9, 2010 at 8:43 Comment(6)
Clustered indexes are only in physical order if there is zero fragmentation. It is possible for the logical and physical order to be different.Cholecalciferol
Thank you for that correction. Am I enforcing lack of fragmentation by using these methods? How bad is fragmentation anyway?Dwayne
It is a meaningless correction. Clustered indices are in physical order by definition. That the table may be fragmented is another level of inquiry, it does not change the definition. Saying that, is as silly as saying that if your are using RAID5, then everything is fragmented, implying: so don't bother with any indices. If you follow the PageChain, it is in order. If you read serially (which is not possible) it will appear fragmented. If you have no CI, you have a Heap. Read up on it. Never cluster on IDENTITY, you guarantee a hotspot on the last (insert) page.Sitnik
@Sitnik - It is not a meaningless distinction at all. If people believe that the clustered index is always in physical order then they will probably end up having entirely the wrong idea about what happens when data is inserted into the middle of a full clustered index. (What actually happens is a page split and a new page allocation that may be in a completely different extent.). The level of fragmentation of the clustered index is very important for performance (of range scans in particular).Cholecalciferol
@Martin(both). 1) Yes, I am aware of that. So use FILLFACTOR and RESERVEPAGEGAP, that's what they are there for, for interspersed INSERTS. And for a monotic column your comment is irrelevant 2) The CI is in physical order by definition; fragmentation has to be considered, but that is another level, it does not change the definition (it is silly to suggest the definition is false just because you found out about fragmentation; just as silly as saying that if you're using RAID5, then everything is fragmented (true) and therefore the CI is fragmented (false) ).Sitnik
3) Page Chain fragmentation is only one type of fragmentation. There is a difference between perfect physical order due to freshly recreating and CI, which is incidental, and various forms of fragmentation due to use. The definition does not change and you do not have the privilege to try. 4) Are you using two SO handles ?Sitnik
O
1

Can I use clustered indexes to speed up bulk inserts in big tables?

Never! Imagine another million rows that you need to put in that table and have them physically ordered it is a colossal loss in performance in the long run.

Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

Absolutely. By the way, clustered index is no silver bullet and may be slower than your ordinary index.

Ovoid answered 17/9, 2010 at 9:2 Comment(8)
Do you disagree with this statement? "It is faster to insert rows into a table with a clustered index as the primary key than it is to insert the same data into a heap that has a non-clustered index as its primary key. This is true whether or not the primary key is monotonically increasing or not."Cholecalciferol
Why wouldn't I want to have them physically ordered on Company/Department? I am only adding chunks based on that combination, I will not be adding mixed sets (ever). So to add them I would only have to touch one physical part of the harddisk. Furthermore, I have a lot of queries that run only on a unique subset based on those columns.Jewelfish
@Martin Smith, @Jewelfish I'm talking about data warehousing where you have to deal with millions of records that have to get inserted among existing data if there's clustered index on that table. And when you remove that clustered index, this data get appended to the end of the table which is obviously faster.Ovoid
So you would remove the clustered index altogether, not even on the IDENTITY column?Jewelfish
@Jewelfish I would and an existing clustered index really defeats the purpose of this bulk insert operation.Ovoid
Db question. That is incorrect. If you insert large amounts of data (bulk copy); drop the CI first !!! and add it after the bulk load. Yes, a CI is always faster than a Heap+NCI. Er, laws of physics: 2 x write is slower than 1 x write.Sitnik
@PDBA: Dropping and recreating the CI is costing me around 4 hours to complete, inserting is taking me 10 minutes. How does that stand in relation to your argument for the former?Jewelfish
@LG. I am not arguing, I am stating that all these things must be taken into consideration; not 1 item in isolation; there is no universal right ot wrong; only a Right for your specific context. Let me provide an Answer instead of comments.Sitnik
R
1

Have a look at the System.Data.SqlClient.SqlBulkCopy API. Given your requirements to write signficant numbers of rows in and out of the database, it might be what you need?

Bulk copy streams the data into the table in a single operation then performs the index check once. I use it to copy 500,000 rows in and out of a database table and it's performance is an order of magnitude better than any other technique I've tried, assuming that your application can be structured to take use of the API?

Roswell answered 29/9, 2010 at 21:39 Comment(1)
As far as I know, both SSIS and the BULK INSERT operation make use of the same technique as this API. I am using SSIS now when reading from files, and when copying between tables I just use normal SQL. Can this API copy between tables?Jewelfish
U
0

i've been playing around with some etl stuff the last little bit. i went through jsut regularly inserting into the table, then removing and readding indexes before and after the insert, tried merge statements, then i finally tried ssis. I'm sold on ssis. Just yesterday i managed to cut an etl process (~24 million records, ~6gb) from ~1-1 1/2 hours per run to ~24 minutes, jsut by letting ssis handle the inserts.

i believe with advanced services you should be able to use ssis.

Ulane answered 17/9, 2010 at 15:33 Comment(1)
As far as I know, SSIS is not faster than doing a BULK INSERT operation.Jewelfish
S
0

(Given you have already chosen the Answer and given yourself the points, this is provided as a free service, a charitable act !)

A little knowledge is a dangerous thing. There are many issues to be considered; and they must be considered together. Taking any one issue and examining it in isolation is a very fragmented way to go about administering a database: you will forever be finding some new truth and changing eveything you thought before. Before launching into it, please read this ▶question/answer◀ for context.

Do not forget, these days anyone with a keyboard and a modem can get their "papers" published. Some of them work for MS, evangelising the latest "enhancement"; others publish glowing reports of features they have never used, or used only once, in one context, but they publish that it works in every context. (Look at Spence's answer: he is enthusiastic and "sold" but under scrutiny, the statements are false; he is not a bad person, just typical of the masses in the MS world and how they operate; how they publish.)

  • Note: I use the term MicroSofties to describe those people who believe in the gatesian notion that any unqualified person can administer a database; and that MS will fix everything. It is not intended as an insult, more as an endearment, because of the belief in magic, and the suspension of the laws of physics.

Clustered Indices

Were designed for Relational databases, by real engineers (Sybase, before MS acquired the code) who have more brains than all of MS put together. Relational databases have Relational Keys, not Idiot keys. These are multi-column keys, that automatically distribute the data, and therefore the insert load, eg. inserting Invoices for various Companies all the time (although not in our discussed case of "chunks").

  • if you have good Relational keys, CIs provide Range Queries (your (1) & (2) ), and other advantages, that NCIs simply do not have.

  • Starting off with Id columns, before modelling and normalising the data, severely hinders the modelling and normalisation processes.

  • If you have an Idiot database, then you will have more indices than not. The contents of many MS databases are not "relational", they are commonly just unnormalised filing systems, with way more indices than a Normalised database would have. Therefore there is a big push, a lot of MS "enhancements" to try and give these abortions a bit of speed. Fix the symptom but don't go anywhere near the problem that caused the symptom.

  • In SQL 2005 and again in 2008 MS has screwed around with CIs, and the result is they are now better in some ways, but worse in other ways; the universality of CIs has been lost.

  • It is not correct that NCIs carry the CI (the CI is the basic single storage structure; the NCIs are secondary, and dependent on the CI; that's why when you re-create a CI, all the NCIs are automatically re-created). The NCIs carry the CI Key at the leaf level.

  • Microsoft has its problems, which change with the major releases (but are not eliminated):

    • and in MS this is not efficiently done, so the NCI index size is large; in enterprise DBMS when this is efficiently done, this is not a consideration.

    • In the MS world, therefore, it is only half true, that the CI key should be as short as possible. If you understand that the consideration is the size of NCIs, and if you are willing to incur that expense, it return for a table that is very fast due to a carefully constructed CI, then that is the best option.

    • The common advice that the CI should be theIdiot column is totally and completely wrong. The worst canditate fo a CI key is a monotonically increasing value (IDENTITY, DATETIME, etc). WHy ? because you have guaranteed that all concurrent inserts will fight for the current insert location, the last page on the index.

    • The real purpose of Partitioning (Which MS provided 10 years after the Enterprise vendors) is to spread this load. Sure, they then have to provide a method of allocating the Partitions, on guess what, nothing but a Relational Key; but to start with, now the Idiot key is spread across 32 or 64 Partitions, providing better concurrency.

  • the CI must be Unique. Relational dbs demand Unique keys, so that is a no-brainer.

    • But for the amateurs who have poured non-relational contents into the database, if they do not know this rule, but they know that the CI spreads the data (a little knowledge is a dangerous thing), they keep their Idiot key in a NCI (good) but they create the CI on an almost-but-not-quite Unique Key. Deadly. CI's must be Unique, that is a design demand. Duplicate (remember we are talking CI Key here) rows are off-page, located in Overflow pages, and the (then) last page; and constitute a method of badly fragmenting the Page Chain.

    • Update, since this point is being questioned elsewhere. I have already stated the MS keeps changing the methods without fixing the problem.

      • The MS Online manual, with their pretty pictures (not technical diagrams) tells us that In 2008, they have replaced (substitued one for another) Overflow Pages, with the adorable "Uniqueifier".

      • That totally satisfies the MicroSofties. Non-Unique CIs are not a problem. It is handled by magic. Case closed.

      • But there is no logic or completeness to the statements, and qualified people will ask the obvious questions: where is this "Uniqueifier" located ? On every row, or just the rows needing "Uniqueifying". DBBC PAGE shows it is on every row. So MS has just added a 4-byte secret column (including handling overhead) to every row, instead of a few Overflow Pages for the non-unique rows only. That's MS idea of engineering.

      • End Update

    • Anyway, the point remains, that Non-Unique CIs have a substantial overhead (now more than before) and should be avoided. you would be better off adding a 1- or 2-byte column yourself, to force uniqueness. .

  • Therefore, unchanged from the beginning (1984), the best candidate for a CI is a multi-column unique Relational key (I cannot say that yours is for sure, but it certainly looks like it).

  • And put any monotonically increasing keys (IDENTITY, DATETIME) in an NCI.

  • Remember also that the CI is a single storage structure, which eliminates the (otherwise) Heap; the CI B-Tree is married to the rows at the Leaf level; the Leaf Level entry is the row. That guarantees one less read on every access.

    • So it is not possible, that a NCI+Heap can be faster than a CI. Anther common myth in the MS world that defies the laws of physics: navigating a B-Tree and writing to the one place you are already in, has got to be faster than additionally writing the row to a separate storage structure. But MicroSofties do believe in magic, they've suspended the laws of physics.
      .
  • There are many other features you need to learn and use, I will mention at least FILLFACTOR and RESERVEPAGEGAP, to give this post a bit of completeness. Do not use these features until you understand them. All performance features have a cost that you need to understand and accept.

  • CIs are also self-trimming at both the Page and Extent level, there is no wasted space. PageSplits are something to monitor for (Random inserts only), and that is easily modulated by FILLFACTOR and RESERVEPAGEGAP.

  • And read the SO site for Clustered Indices, but keep in mind all the above, esp. the first two paras.

Your Specific Case

  • By all means, get rid of your surrogate keys (Idiot columns), and replace them with true natural Relational keys. Surrogates are always an additional key and index; that is a price that should not be forgotten or taken lightly.

  • CompanyIdentifier+DepartmentIdentifier+[uniquiefier] is exactly what I am talking about. Now notice that they are already INTs, and very fast, so it is very silly to add a NUMERIC(10,0) Idiot Key. Use a 1- or 2-byte column toforce Uniqueness.

  • If you get this right, you may not need a Partition licence.

  • The CompanyIdentifier+DepartmentIdentifier+[uniquifier] is the perfect candidate (not knowing anything about your db other than that which you have posted) for a CI, in the context that you perform mass delete/insert periodically. Detailed above.

    • Contrary to what others have stated, this is a good thing, and does not fragment the CI. Lets' say ou have 20 Companies, and you delete 1, which constitutes 5% of the data. That entire PageChain which was reasonably contiguous, is now relegated to the FreePageChain, contiguous and intact. To be precise, you have a single point of fragmentation, but not fragmentation in the sense of the normal use of the word. And guess what, if you turn around and perform a mass insert, where do you think that data will go ? That's right the exact same physical location as the Deleted rows. And the FreePageChain moves to the PageChain, extent and page at a time.
      .
  • but what is alarming is that you did not know about the demand for CI to be Unique. Sad that the MicroSofties write rubbish, but not why/what each simplistic rule is based on; not the core information. The exact symptom of non-unique CIs is, the table will be very fast immediately after DROP/CREATE CI, and then slow down over time. An good Unique CI will hold its speed, and it would take a year to slow down (2 years on my large, active banking dbs).

  • 4 hours is a very long time for 1 Billion rows (I can recreate a CI on 16 billion rows with a 6-column key in 3 minutes on an enterprise platform). But in any case, that means you have to schedule it as regular weekly or demand maintenance.

  • why aren't you using the WITH SORTED_DATA option ? Wasn't your data sorted, before the drop ? This option rewrites the CI Non-leaf pages but not the leaf pages (containing the rows). It can only do that if it is confident that the data was sorted. Not using this option rewrites every page, in physical order.

Now, please be kind. Before you ask me twenty questions, read up a little and understand all the issues I have defined here.

Sitnik answered 2/12, 2010 at 4:24 Comment(5)
+1 for the energy you put in, +1 for suggesting unique CI, +1 for WITH_SORTED_DATA, -1 for too much offtopic information, -1 for disrespect to Spence, -1 for 'MicroSofties', and -1 for annoying use of 'Idiot' for Id columns. You really make me angry, because you are giving useful information, but it comes with a huge package of insults. I cannot with good conciousness upvote, accept or even use this answer, because if I do I will be encouraging your behaviour. I have flagged it as offensive, so let the ops deal with it.Jewelfish
as a side-note, re monotonically increasing values - a lot of systems are query intensive, not insert intensive. In such cases, having the CI reflect the most common "I need this range" (which is typically a row-id range or a time-range) with optimum performance is ideal.Sequester
@Marc. I work with both. That is incorrect. Query intensive systems need range queries, yes, but on real relational keys, not Ids. Time series should always be handled as a child of the parent (which has real keys). Id force unnecessary joins, that can be eliminated. if you are handling time series and Ids as the leading or only column of the table, you have a data heap, not a data base; so sure it is very slow, and you need all the enhancement you can get. But if you go back and handle the causative problem (Idiot keys instead of Relational Keys), you will gain far greater performance.Sitnik
@littlegreen: I have inserted an update re your other question, and added a couple of paras at the end.Sitnik
You don't get the bonus points when you accept your own answer, just FYI. People are also free to change the accepted answer if you provide a better one.Bunton

© 2022 - 2024 — McMap. All rights reserved.