are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?
Asked Answered
T

5

12

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.

and so: are there concerns with SELECT * that are still relevant today?


greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.

for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?

thanks! :)

Tungstic answered 2/9, 2016 at 15:29 Comment(15)
Doesn't pretty much any ORM explicitly define the column names anyway? If you're using an ORM to access the data, wouldn't that make SELECT * kind of a moot point?Kab
I'd consider it less of a performance issue but rather your query outputs could change if your database schema changes. You've written a query to pull out certain data, why risk this changing without your knowledge?Oenomel
@RichBenner when using an ORM, this is not an issue. Dapper and Doctrine (two ORMs I happen to have experience with; I'm relatively certain other ORMs work similarly) map query output to an object. you build a class whose structure matches the table definition (field name and types), and the ORM handles the rest. it's also smart enough to ignore mismatches, so you CAN select only the fields you need, and leave your object with default values on the other properties (I'm not sure when I would ever choose to do that, though, unless the table is truly enormous and I only want a couple columns).Tungstic
ORMs execute SQL statements.The same concerns apply. If you dont want a column, don't load it. If you load the mapped objects when you only want a couple of field, you are harming performance just as if you used *. That's why a standard advice even with ORMs is to load only what you need.Nordgren
@Tungstic there is a lot of extremely bad data access code in PHP - for example, generating SQL through string concatenation resulting in SQL injection attacks. As for nulls, they aren't just slow (how do you match against a null?), they are a sign that the schema design is bad. What does NULL mean? Missing value? Not applicable? MySQL allows or even encourages practices that appear convenient yet result in horrible performanceNordgren
@PanagiotisKanavos different ORMs seem to give you different levels of access to SQL. in Doctrine, for example, you almost NEVER write SQL; whether it writes "SELECT field1, field2, field3" behind the scenes (on every field in the entity class), or "SELECT *", I don't know, but I definitely CAN'T select "only the fields I need." meanwhile, Dapper has you write simple SELECT statements by hand, which it then maps the results of to an object, so you COULD select fewer fields. is one of the two "better"? I don't know. it is interesting to see different ORMs taking different approaches here.Tungstic
@Tungstic not really - ORMs still generate and execute SQL. That's what they do, they map objects to SQL statements. You don't need Dapper to laod specific fields either - just use a proper LINQ query with EF or NH. When using ORMs you have to be more careful to avoid performance problems - no matter what language, environment, OS or DBMS you useNordgren
@PanagiotisKanavos we'll probably always get the fastest SQL by hand-writing (hand-string-concat'ing?) each query for each individual situation, but we accept slightly-slower performance for easier-to-maintain code via ORMs (which often don't let you write SQL at all). I wonder specifically if there's any concern about using SELECT * in an ORM which DOES have you write SQL for it; ex: do I spend my time adding a "fields" parameter to a utility method which finds an entity by a non-ID field, so that the caller can name only the fields needed, or do I just SELECT * into the ORM and not care? etcTungstic
^ Stack Overflow has very limited comment space. I can imagine spending a ton of time trying to remove all SELECT *s from our code in the name of performance, or security, or whatever, but when an ORM is handling most of the logic, again: does it matter? is my time better spent elsewhere? is EVERYONE'S time better spent by having a cleaner interface that isn't worried about which fields are exactly being selected?Tungstic
@Tungstic you tell me. Once you call the execute method what do you get? All columns or only those you need? If you get 100 columns instead of 2, where did these values come from? How much data did you transfer over the network? How many data pages did you have to load from disk instead from the cached index pages? The performance difference is not slight. It's plain old bad data access.Nordgren
@PanagiotisKanavos perhaps I should do some performance measuring myself; since this is an issue people have strong opinions on, I'd expect there to be more research like this already done. ex: if the difference is not slight, do you know of any resources that have documented this?Tungstic
@Tungstic you should really, really take time to understand what is going on. An interface that loads everything is actually dirtier and far more expensive than one that does a proper job. You have to buy more web servers and a larger database server to load useless data. Not good, not scaleable.Nordgren
@Tungstic there are tons of research and advice. It's just not what people want to hear, which typically results in people ignoring the research. Which is why you hear about scalability disasters so frequently.Nordgren
@PantagiotisKanavos I understand how ORMs work. I've gone from hand-writing SQL in PHP 12-ish years ago to rarely writing it, today. I understand there are performance hits, both to getting more data than you need, and to using ORMs. but I also understand the importance of clean, self-explanatory, easy-to-read, and easy-to-maintain code. and there is often a trade-off there in performance, too, but the question isn't "is selecting more data than you need EVER worse?" - of course it is - the question is "is it worse to such a degree that it should never be the default approach?" esp with an ORMTungstic
(and as mentioned in the OP, most advice/research I've found is 5-10 years old, without the aid of ORMs. I'd love to see something more-modern!)Tungstic
O
5

This question is out some time now, and noone seems to be able to find, what Ben is looking for...

I think this is, because the answer is "it depends".

There just NOT IS THE ONE answer to this.

Examples

  • As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
  • If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
  • If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
  • If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
  • If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
  • Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
  • If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
  • Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...

I could go on. There just is not ONE answer. It just depends on to many factors.

Odrick answered 15/9, 2016 at 19:54 Comment(2)
some of these sound like concerns for the ORM. ex: I know Doctrine uses lazy-loading on joined data (unless you tell it to join immediately), and again, it's building the SQL behind the scenes, almost certainly with SELECT * (or equivalent), though I can't say for certain. my question started with Dapper, however, which does NOT offer lazy-loading (at least not out-of-the-box), and DOES give you control over the SQL query... which is all to say that you're probably right on "it depends". it's not the most thrilling answer, but I think it is the correct one. thanks for all your comments!Tungstic
also: my experience with Symfony/Doctrine has been "if there are performance inefficiencies in the ORM, they are a trivial concern compared to gains from caching", which you also touched on here. but again, adding Dapper to a small C# project is not the same environment. so again: it depends.Tungstic
H
14

I will not touch maintainability in this answer, only performance part.

Performance in this context has little to do with ORMs.

It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.

It is still a bad idea to select columns that you don't need.

It doesn't really matter from the performance point of view whether the query looks like:

SELECT * FROM Table

or all columns are listed there explicitly, like:

SELECT Col1, Col2, Col3 FROM Table

If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.


Why selecting unnecessary columns is a bad idea:

  • extra bytes to read from disk

  • extra bytes to transfer over the network

  • extra bytes to parse on the client

  • But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.

There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.

Now that we have established a common understanding of why selecting everything is bad for performance, you may ask why it is listed as a myth? It's because many people think the star is the bad thing. Further they believe they are not committing this crime because their ORM lists all columns by name anyway. In fact, the crime is to select all columns without thinking about it—and most ORMs readily commit this crime on behalf of their users.


I'll add answers to your comments here.

I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction

It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.

You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.

It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.

Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?

You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.

The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.

And now on top of this you add another layer of ORM abstraction.

Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.

Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.

Hoggish answered 13/9, 2016 at 2:53 Comment(6)
Agree. I would add also 'extra CPU', especially when the data are encrypted or compressedIndifferent
Agree. Most application devs really don't understand scalable performance because in the application layer you can just add more boxes (unless you have a really large operation.) I appreciate the OP's desire to understand better because it doesn't take that big of a shop to get bogged down on data issues. I would suggest to the curious to pick up a reputable performance tuning book to get a better foundation to build a truly scalable system.Etherege
I like this answer; it's the best - in my mind - so far. I wonder - though it may be a bit tangential - if you have any suggestions on how to approach an ORM that - in most cases - doesn't even give you the option of how many fields to select, such as Symfony, Eloquent, etc. (Dapper is the first I've used that gives me this option almost all the time, and the reason this question has even come up for me.) do you see ORMs that don't give this option to be flawed and to be avoided?Tungstic
having gotten to swe's answer, they mentioned that - as a personal opinion - they do not use ORMs in applications where performance is a large concern. would that be your feeling, as well, @Vladimir Baranov?Tungstic
Also remember that if you use SELECT * and there is a join, you are by definition returning unneeded data as the data in the join fields will be repeated.Respectable
@Ben, I added my answer to your comments.Hoggish
O
6

From a SQL-Server-Performance-Point-of-view, you should NEVER EVER use select *, because this means to sqlserver to read the complete row from disk or ram. Even if you need all fields, i would suggest to not do select *, because you do not know, who is appending any data to the table that your application does NOT need. For Details see answer of @sandip-patel

From a DBA-perspective: If you give exactly those columnnames you need the dbadmin can better analyse and optimize his databases.

From a ORM-Point-Of-View with changing column-names i would suggest to NOT use select *. You WANT to know, if the table changes. How do you want to give a guarantee for your application to run and give correct results if you do not get errors if the underlying tables change??

Personal Opinion: I really do not work with ORM in Applications needing to perform well...

Odrick answered 14/9, 2016 at 6:10 Comment(4)
"Personal Opinion: I really do not work with ORM in Applications needing to perform well..." <-- I wonder if this is really getting at the heart of the issue. do you see ORMs only intended for getting something small running quickly? at my previous job, we used them, as-is, (with SELECT *s likely under the hood) for large products, right up through release; optimizations usually involved making better use of caching than worrying about the performance of the ORM itself. does that seem crazy-wrong to you? (my co-workers were more-experienced than I, so I never thought to question this approach)Tungstic
(I want to clarify that we of course made sure that our ORM calls were as efficient as possible - that we didn't join foolishly, etc - but beyond that, we trusted in the ORM's performance, and never questioned how it wrote its queries (which I'm pretty sure always involved SELECT *s))Tungstic
I think even in larger applications (i usually write small customer AND task-specific apps) you HAVE TO use some kind of ORM, because without you have no/harder tier-delamination and other things to think about. But if you have specialized applications working with large amount of data to parse/merge/work and display, you better go without. (as pointed out before: my personal opinion...)Odrick
and to come back a little to your question: at the side of the database nothing changed in how the data is beeing stored or read or cached the last years, so all advice given 5 years ago is still up to date and current.Odrick
O
5

This question is out some time now, and noone seems to be able to find, what Ben is looking for...

I think this is, because the answer is "it depends".

There just NOT IS THE ONE answer to this.

Examples

  • As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
  • If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
  • If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
  • If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
  • If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
  • Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
  • If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
  • Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...

I could go on. There just is not ONE answer. It just depends on to many factors.

Odrick answered 15/9, 2016 at 19:54 Comment(2)
some of these sound like concerns for the ORM. ex: I know Doctrine uses lazy-loading on joined data (unless you tell it to join immediately), and again, it's building the SQL behind the scenes, almost certainly with SELECT * (or equivalent), though I can't say for certain. my question started with Dapper, however, which does NOT offer lazy-loading (at least not out-of-the-box), and DOES give you control over the SQL query... which is all to say that you're probably right on "it depends". it's not the most thrilling answer, but I think it is the correct one. thanks for all your comments!Tungstic
also: my experience with Symfony/Doctrine has been "if there are performance inefficiencies in the ORM, they are a trivial concern compared to gains from caching", which you also touched on here. but again, adding Dapper to a small C# project is not the same environment. so again: it depends.Tungstic
G
3

It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.

This can have several implications:

  • More network traffic

  • More I/O (got to read more data from disk)

  • Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)

  • Possibly even more CPU (a covering index cannot be used so data needs sorting)

EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:

Select colA, colB
From table1 t1
Where Exists (Select * From Table2  Where column = t1.colA)

More Details -1

More Details -2

More Details -3

Gavrielle answered 13/9, 2016 at 10:50 Comment(0)
B
-1

Maintainability point.

If you do a "Select * from Table"

Then I alter the Table and add a column.

Your old code will likely crash as it now has an additional column in it.

This creates a night mare for future revisions because you have to identify all the locations for the select *.

The speed differences is so minimal I would not be concerned about it. There is a speed difference in using Varchar vs Char, Char is faster. But the speed difference is so minimal it is just about not worth talking about.

Select *'s biggest issue is with changes (additions) to the table structure.

Maintainability nightmare. Sign of a Junior programmer, and poor project code. That being said I still use select * but intend to remove it before I go to production with my code.

Bluegill answered 14/9, 2016 at 23:24 Comment(1)
table alterations won't affect any ORM I've used in this way (though it's possible there are more-sensitive ORMs out there)Tungstic

© 2022 - 2024 — McMap. All rights reserved.