select * vs select column
Asked Answered
R

12

138

If I just need 2/3 columns and I query SELECT * instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?

The network overhead might be present if I do select * without a need.

But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?

If it always pulls a tuple then I/O overhead is the same.

At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.

So if that's the case, select someColumn will have more memory overhead than that of select *

Rectangular answered 5/7, 2010 at 14:45 Comment(8)
Is there a specific RDBMS you're asking about? It's possible that how SELECT queries are executed/processed is different from database to database.Toadflax
As an aside, in PostgreSQL, if you say CREATE VIEW foo_view AS SELECT * FROM foo;, then add columns to table foo later on, those columns won't automatically show up in foo_view as expected. In other words, the * in this context only expands once (at view creation time), not per SELECT. Because of complications arising from ALTER TABLE, I would say that (in practice) * is Considered Harmful.Octans
@JoeyAdams - not just PostgresQL, this is also the behaviour of Oracle.Vivianviviana
possible duplicate of Best to use * when calling a lot of fields in mysql?Phrensy
@OMG Ponies: I was not aware of similar post. However These arenot really similer. @Lèse majesté: I am talking about Generic RDBMS. not about any specific vendor @Joey Adams: Hmm I know that * is unsafe. just wanna discuss the performance issues regarding.Rectangular
For SQL Server see sqlblog.org/blogs/aaron_bertrand/archive/2009/10/10/…Fallfish
possible duplicate of Why is SELECT * considered harmful?Fallfish
@Vivianviviana - not just PostreSQL and Oracle, also in Microsoft SQLFerland
E
34

It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.

It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.

Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...

EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:

   Select colA, colB
   From table1 t1
   Where Exists (Select * From Table2
                 Where column = t1.colA)

EDIT: To address @Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.

For other reasons not to use Select *, see Why is SELECT * considered harmful? :

Edgeworth answered 5/7, 2010 at 14:50 Comment(9)
"It always pulls a tuple" are you sure ? Hmm Okay So I was right. if thats the case select * will have less memory overhead than select column but same I/O overhead. so If we leave network overhead. select * if less overhead than that of select columnRectangular
This is NOT true. One example off the top of my head is when you want only the value of an indexed column in MySQL (for example, just to check for row existence), and you're using MyISAM storage engine, it'll grab the data from the MYI file, which could be in memory, and not even go to disk!Wyly
Ya if the requested set of tuple is in memory there will be no I/O but thats special case. So What is the summery. If I select some indexed Column then entire tuple is not read ? otherwise entire tuple is read ?Rectangular
I'm not exactly sure how MySql does caching, but in SQL Server, and In Oracle, even when data is in in-memory cache, it still accesses it using the same Page structre as it would when accessing it from disk. meaning that it would require one memory I/O per page of data... exactly the same as it would from disk. (except memory I/Os are much faster than Disk I/Os of course). Indeed, that's a goal of caching design, to make the access process totally independant on location of the data.Edgeworth
@Charles Bretana: So If I invoke select ColumnName there will be a memory overhead of stripping out not requested Cells out of the tuple and transmitting only the requested Columns. agreed ?Rectangular
@user256007, it's not that much of a special case. I'm not sure of all the cases in which it doesn't read the entire tuple, I know enough exist to always specify. You have plenty of examples already :)Wyly
@user, Not necessarily, this is so insignificant as to not be a concern to DB engine design. In fact, it's more likely that all queries have to "process" each column they deliver from the Page, (Even when using Select *, so then, the fewer columns requested, the LESS the processing... But again, this is insginificant in-memory cpu processing load, so much more important are the potential maintenance and logic errors from using Select * DO NOT Use it for performance reasons..Edgeworth
In Oracle a row can span multiple blocks if it is too long to fit on a single block (known as "chaining"), or if due to an update it needs to grow more than the space available in the current block then the entire row is moved to a new block that does have sufficient space and a pointer is left in the original block to indicate where it has moved to (known as "migration", and this does not modify index entries with the new ROWID)Cymophane
Can you spell out more the "for many other reasons"? Because those were not clear to me. If performance does not matter, why care about requesting column names?Coaly
D
122

There are several reasons you should never (never ever) use SELECT * in production code:

  • since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time

  • if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network

  • if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)

  • in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.

Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.

Dashboard answered 5/7, 2010 at 14:52 Comment(11)
I am only bothered about memory and I/O overhead. I've already mentioned that select * will have more network overhead. according to your second point. you meant select operation don't pull atomic tuples. rather it pulls only the requested columns from the disks. so there will be a memory overhead in select column to check which cell's data to pull. as far I know Data is always stored on disk as tuples. not sure how select pulls it. so select * will not require a through check through Data Structure of the TableRectangular
whilst agreeing with you in practice, you are certainly correct in all cases when fetching column data from the table, as this question addresses), yr emphasis on EVER nevertheless drives me to point out that this rules is not general to ALL Sql queries... specifically, it's use in a subquery after an EXISTS predicate, (as in Where Exists (Select * From ...) the use of Select * is certainly no issue, and in some circles is considered a best practice.Edgeworth
@Charles Bretana: yes, the IF EXISTS(SELECT *... is a special case - since there, no data is really retrieved, but it's just a check for existance, the SELECT * is not an issue there...Dashboard
Typically if we need to consistently access specific parts of a table, we will create a view containing only the columns we need. Of course, we then do SELECT * from my_view. From a performance POV, is this just as bad as selecting all from the table?Hervey
On the other hand, if you SELECT * in PostgreSQL you get a well-formed data type back you can actually do something with rather than a generic record that you can't pass to other stored procedures directly.Aposiopesis
What about if I'm developing an API that makes it possible to retrieve data from one of my tables. Since I wouldn't know which data the user is interested in, I suppose SELECT * would be acceptable?Pembrook
@SimonBengtsson: I would still argue against this - suppose you have some "administrative" data in specific columns in your table that you don't want to expose to the customer? I would always explicitly specify a list of columns to fetchDashboard
Thats true. What about when querying a view that was specifically setup to be used with the API?Pembrook
@Dashboard What about SELECT column1 FROM (SELECT * FROM table1). Is this also considered bad practice?Desertion
@Steve: yes - you should always avoid SELECT * in production / professional code situations - possibly except for the IF EXISTS (SELECT * FROM ... WHERE ....) situation - there, the * doesn't hurt, since no data is really being fetched, but only the existance of a specific row matching a WHERE clause is checkedDashboard
@Dashboard In my example, no column is fetched except for column1. Then why isn't it ok?Desertion
E
34

It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.

It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.

Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...

EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:

   Select colA, colB
   From table1 t1
   Where Exists (Select * From Table2
                 Where column = t1.colA)

EDIT: To address @Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.

For other reasons not to use Select *, see Why is SELECT * considered harmful? :

Edgeworth answered 5/7, 2010 at 14:50 Comment(9)
"It always pulls a tuple" are you sure ? Hmm Okay So I was right. if thats the case select * will have less memory overhead than select column but same I/O overhead. so If we leave network overhead. select * if less overhead than that of select columnRectangular
This is NOT true. One example off the top of my head is when you want only the value of an indexed column in MySQL (for example, just to check for row existence), and you're using MyISAM storage engine, it'll grab the data from the MYI file, which could be in memory, and not even go to disk!Wyly
Ya if the requested set of tuple is in memory there will be no I/O but thats special case. So What is the summery. If I select some indexed Column then entire tuple is not read ? otherwise entire tuple is read ?Rectangular
I'm not exactly sure how MySql does caching, but in SQL Server, and In Oracle, even when data is in in-memory cache, it still accesses it using the same Page structre as it would when accessing it from disk. meaning that it would require one memory I/O per page of data... exactly the same as it would from disk. (except memory I/Os are much faster than Disk I/Os of course). Indeed, that's a goal of caching design, to make the access process totally independant on location of the data.Edgeworth
@Charles Bretana: So If I invoke select ColumnName there will be a memory overhead of stripping out not requested Cells out of the tuple and transmitting only the requested Columns. agreed ?Rectangular
@user256007, it's not that much of a special case. I'm not sure of all the cases in which it doesn't read the entire tuple, I know enough exist to always specify. You have plenty of examples already :)Wyly
@user, Not necessarily, this is so insignificant as to not be a concern to DB engine design. In fact, it's more likely that all queries have to "process" each column they deliver from the Page, (Even when using Select *, so then, the fewer columns requested, the LESS the processing... But again, this is insginificant in-memory cpu processing load, so much more important are the potential maintenance and logic errors from using Select * DO NOT Use it for performance reasons..Edgeworth
In Oracle a row can span multiple blocks if it is too long to fit on a single block (known as "chaining"), or if due to an update it needs to grow more than the space available in the current block then the entire row is moved to a new block that does have sufficient space and a pointer is left in the original block to indicate where it has moved to (known as "migration", and this does not modify index entries with the new ROWID)Cymophane
Can you spell out more the "for many other reasons"? Because those were not clear to me. If performance does not matter, why care about requesting column names?Coaly
T
22

You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.

Tetrabranchiate answered 5/7, 2010 at 14:48 Comment(5)
+1 for an edge case that I believe not many will think of at first glance - indexes on the client side and added/changed columns.Finnigan
Yea, but is use of numeric indices for columns that common? I've always accessed column data using string keys or property names if using ORM.Toadflax
saw this a long time ago, junior programmer selected * from a table and made assumptions about column order; all his code broke as soon as someone else changed the table. What fun we had.Hax
It's probably a bad idea to use column order in general just for code readability's sake, doubly bad to use SELECT * with it.Toadflax
Wow, accessing columns by index in client code seems like a phenomenally bad idea. For that matter, relying on the order in which columns appear in a result set in any way feels very dirty to me.Sanctum
B
7

Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.

On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.

Britannic answered 5/7, 2010 at 17:39 Comment(0)
C
6

This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.

Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.

However, I will admit that I otherwise usually query for all the columns in a table.

Citizenship answered 5/7, 2010 at 14:49 Comment(8)
LOB columns are always my favourite example of the perils of SELECT *. So I was about to upvote you, until I read the third paragraph. Tsk, tsk. What happens if some other developer adds a BLOB to a table which currently doesn't have such a column?Vivianviviana
@Vivianviviana - This is true - but it's a confession; I know I should know better :'( Can you forgive me if I don't use * - I actually list the columns?Citizenship
@APC, I wish I could upvote your comment more. Think of your poor coworker who just wants to add a column without causing a huge performance meltdown! Think of how angry they'll be when they discover after a few hours your innocent looking select *.Wyly
Ya If the table contains a LOB field select * should always be avoided. But I am talking about the generic theoretic aspect.Rectangular
@user256007, yes, even without BLOB... BLOB just illustrates the extreme example. Check my response to Charles, there are times when selecting specific columns can enable you to grab the data from memory without even going to disk!Wyly
@Vivianviviana & @Mike Sherov - I'm curious - what is your opinion on frameworks which use App-Models or Active-Records or Domain-Models with Data-Mappers, since they usually query for all columns automatically?Citizenship
@Richard, I think they are great for when optimizing DB performance is not your main concern, which is 99% of the time. As with most frameworks, they tend to generalize things to enable faster development while sacrificing pure performance. As Knuth said: "Premature optimization is the root of all evil." When you get to the point where you need to worry about the performance of select columns vs. select *, (ask Twitter about RoR) you can worry about it and optimize it then. If the framework isn't robust enough to support that, then I'd say you're using the wrong framework.Wyly
@user256007 - the general rule is "don't use SELECT *'. The answer from marc_s has all the reasosn why this is the case.Vivianviviana
S
6

During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.

It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.

Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.

Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.

The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.

DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.

However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.

If you have:

CREATE TABLE customer (
    id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR(150) NOT NULL,
    city VARCHAR(30),
    state VARCHAR(30),
    zip VARCHAR(10));

CREATE INDEX k1_customer ON customer(id, name);

Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.

Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.

Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.

This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.

In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.

Silicic answered 5/7, 2010 at 15:49 Comment(1)
Will, I have downvoted your answer, just because you use NOT NULL along with the PRIMARY KEY. Is there a good reason for you to write this way?Mangrove
S
5

I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.

Surplice answered 5/7, 2010 at 16:43 Comment(0)
E
5

The accepted answer here is wrong. I came across this when another question was closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).

You should always use SELECT attribute, attribute.... NOT SELECT *

It's primarily for performance issues.

SELECT name FROM users WHERE name='John';

Is not a very useful example. Consider instead:

SELECT telephone FROM users WHERE name='John';

If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a covering index.

Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet... using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.

It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.

Ethnomusicology answered 6/3, 2013 at 15:19 Comment(0)
A
2

There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.

The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.

On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)

As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.

But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.

Aposiopesis answered 7/3, 2013 at 2:43 Comment(0)
B
2

Reference taken from this article:

Without SELECT *: When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application. This will create extra cost and load on database system and more data travel across the network.

With SELECT *: If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.

Bourbon answered 7/2, 2016 at 12:35 Comment(0)
D
0

Just to add a nuance to the discussion which I don't see here: In terms of I/O, if you're using a database with column-oriented storage you can do A LOT less I/O if you only query for certain columns. As we move to SSDs the benefits may be a bit smaller vs. row-oriented storage but there's a) only reading the blocks that contain columns you care about b) compression, which generally greatly reduces the size of the data on disk and therefore the volume of data read from disk.

If you're not familiar with column-oriented storage, one implementation for Postgres comes from Citus Data, another is Greenplum, another Paraccel, another (loosely speaking) is Amazon Redshift. For MySQL there's Infobright, the now-nigh-defunct InfiniDB. Other commercial offerings include Vertica from HP, Sybase IQ, Teradata...

Diehl answered 12/3, 2016 at 4:4 Comment(0)
A
-1
select * from table1 INTERSECT  select * from table2

equal

select distinct t1 from table1 where Exists (select t2 from table2 where table1.t1 = t2 )
Accordance answered 17/7, 2017 at 11:38 Comment(1)
Could you please format your code by highlighting it and hitting Ctrl+KAllonym

© 2022 - 2024 — McMap. All rights reserved.