Sql Server Legacy Database To Clustered index or not

Asked 20/8, 2010 at 21:25 Answered 21/8, 2010 at 2:57

Solved sql sql-server indexing clustered-index identity-column

We have a legacy database which is a sql server db (2005, and 2008).

All of the primary keys in the tables are UniqueIdentifiers.

The tables currently have no clustered index created on them and we are running into performance issues on tables with only 750k records. This is the first database i've worked on with unique identifiers as the sole primary key and I've never seen sql server be this slow with returning data.

I don't want to create a clustered index on the uniqueidentifier as they are not sequential and will therefore slow the apps down when it comes to inserting data.

We cannot remove the uniqueidentifier as that is used for remote site record identity management purposes.

I had thought about adding a big integer identity column to the tables and creating the clustered index on this column and including the unique identifier column.

i.e.

int identity - First column to maintain insert speeds unique identifier - To ensure the application keeps working as expected.

The goal is to improve the identity query and joined table query performance.

Q1: Will this improve the query performance of the db or will it slow it down?

Q2: Is there an alternative to this that I haven't listed?

Thanks Pete

Edit: The performance issues are on retrieving data quickly through select statements, especially if a few of the more "transactional / changing" tables are joined together.

Edit 2: The joins between tables are generally all between the primary key and foreign keys, for tables that have foreign keys they are included in the non-clustered index to provide a more covering index.

The tables all have no other values which would provide a good clustered index.

I'm leaning more towards adding an additional identity column on each of the high load tables and then including the current Guid PK column within the clustered index to provide the best query performance.

Edit 3: I would estimate that 80% of the queries are performed on primary and foreign keys alone through the data access mechanism. Generally our data model has lazy loaded objects which perform the query when accessed, these queries use the objects id and the PK column. We have a large amount of user driven data exclusion / inclusion queries which use the foreign key columns as a filter based on the criteria of for type X exclude the following id's. The remaining 20% is where clauses on Enum (int) or date range columns, very few text based queries are performed in the system.

Where possible I have already added covering indexes to cover the heaviest queries, but as yet i'm still dissapointed by the performance. As bluefooted says the data is being stored as a heap.

Phobia answered 20/8, 2010 at 21:25 Comment(6)

Do you currently have a non-clustered index on the uniqueidentifiers? – Cadmium 20/8, 2010 at 21:35

Yes we have non-clustered indexes on the unique identifiers. – Phobia 20/8, 2010 at 21:39

Since you have at least one index on that column you are already incurring a performance penalty on the insert. Depending on the structure of the table you may just be able to drop the non clustered index and switch to clustered with little impact to what you are currently seeing. – Cadmium 20/8, 2010 at 22:1

My understanding of clustered indexes is that the data is stored in the order specified by the index which in the case of non-sequential guids it has a huge impact on insert performance. However non-clustered is just pointers to the records which doesn't degrade insert performance dramatically. I'm going to double check that information and make sure i'm correct in my understanding. – Phobia 20/8, 2010 at 22:26

Peter, that's not 100% correct. non-clustered indexes also have to be ordered (note that for all these indexes this is each page being ordered, not the rows inside each page as compared to other rows in the same page). It's true that generally nonclustered indexes are narrower so they are more efficient, but you're still getting page splits all over the place when you insert. – Spicate 21/8, 2010 at 2:54

Emtucifor, I see the ordering technique of the non-clustered index. Again my understanding fill-factor is the key part that aids sql server to not have to "re-organize" the pages on an insert unless the index page reaches capacity. – Phobia 21/8, 2010 at 3:28

If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.

I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.

Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.

Paternoster answered 21/8, 2010 at 2:13 Comment(8)

Blue unfortunately I can't post the data model as my company is extremely strict when it comes to this sort of thing. In some tables there are date columns which are used for range scans. Generally the primary keys (guids) and foreign keys (guids) are used for joins, there isn't any joins on plain text columns. I was wondering about using the date created, but I only just added that in the last release and therefore not every row has this value. I think adding the artificial integer key with the identity turned on is the path i will take. The guid is the PK in every table. – Phobia 21/8, 2010 at 3:26

Date is often a good candidate for a clustering key, especially if queries are frequently restricted by date range. If you just added it, I suppose it's not used too often, but at least with date created you can ensure ordered insertion and therefore minimize page splits and fragmentation. My only concern about the artificial key is that it will probably never be useful unless you change your data model to join on the new key instead of the GUID. – Paternoster 21/8, 2010 at 3:51

Whatever you end up choosing, be sure to examine your non-clustered indexes as well. If your GUID index contains only the GUID column, it may not be terribly useful. You might want to investigate adding some columns to the INCLUDE clause to cover your most common queries. – Paternoster 21/8, 2010 at 3:52

Oh sorry, I just read your Edit about the covering indexes :) – Paternoster 21/8, 2010 at 3:54

Blue, on a table that has a max of 10 columns, with a maximum of 100 records (only edited during setup), would a clustered index be ok on the pk of this table even if it was a guid? save speeds are not important for these types of table. – Phobia 21/8, 2010 at 3:56

I was thinking the new identity (artificial key) clustered index would include the current primary key as the 2nd column within the index. This way the Guid id's could be referenced quicker by sql server. Is this thinking incorrect? – Phobia 21/8, 2010 at 3:58

Well, here's the thing with clustered indexes. The clustered index key is used to lookup the data from your non-clustered indexes, so the entire key is added to the leaf level of all your non-clustered indexes. This means that if you have a really wide clustered index key, not only does it bloat your clustered index, it bloats all your non-clustered indexes as well. On a table with only 100 records, it's probably not a big deal since depending on the size of the rows a 100 record table is probably only a few pages anyway, but keep it in mind. – Paternoster 21/8, 2010 at 4:17

If you add the GUID to your new clustered indexes as a second column, you won't have the same page split/fragmentation hit, but you'll end up with a really wide key. Also keep in mind that the leaf level of the clustered index contains all the data in the row, so the GUID column is already going to be part of the clustered index on the leaf level, not sure if it would actually be helpful as part of the key unless you use the artificial key for lookups in conjunction with the GUID. – Paternoster 21/8, 2010 at 4:21

I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.

Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.

Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.

Apoloniaapolune answered 20/8, 2010 at 21:31 Comment(1)

The identifiers are generated through .Net Frameworks Guid.NewGuid(); as the architecture of this system was built to have the objects generate their own id. (Again a legacy system.) – Phobia 20/8, 2010 at 21:40

It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?

I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.

UPDATE

Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.

As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:

[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);

public static Guid SequentialGuid()
{
    const int RPC_S_OK = 0;
    Guid g;
    if (UuidCreateSequential(out g) != RPC_S_OK)
        return Guid.NewGuid();
    else
        return g;
}

newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.

Spicate answered 21/8, 2010 at 2:57 Comment(3)

The fillfactor is definitely a possibility, although again it's going to bloat your index. Having GUID keys in your clustered index is not ideal, but since you're already stuck with the GUIDs and they are used for the majority of your joins, you may have to just go with it. I think bottom line is you need to do some testing to figure out the best way. Do you have a test system you can use to try out the different methods? – Paternoster 21/8, 2010 at 4:52

I agree it sucks to use those huge GUIDs, and lowering the number of rows per page by reducing the fillfactor is unfortunate, but if he wants improved update performance at the cost of read performance, that's the way to go. It's not even as straight a proposition as that because page splits already reduce page row density... – Spicate 21/8, 2010 at 7:57

Yeah, I hear you, it's definitely not an easy question, but his stated goal is actually to improve read performance: "The goal is to improve the identity query and joined table query performance." Unfortunately, it's tough to make general recommendations without looking at actual execution plans. In this case, I think the best idea is to test a couple different strategies and see what works best. – Paternoster 21/8, 2010 at 14:43

You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.

You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.

Videogenic answered 20/8, 2010 at 21:44 Comment(1)

You're right bob I edited the post. The issues are when we query the more transactional tables (i.e. the ones that change the most not the static tables.) This is compounded if we join multiple transactional tables together. – Phobia 20/8, 2010 at 21:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags