Are there disadvantages to using a generic varchar(255) for all text-based fields?
Asked Answered
N

7

106

I have a contacts table which contains fields such as postcode, first name, last name, town, country, phone number etc, all of which are defined as VARCHAR(255) even though none of these fields will ever come close to having 255 characters. (If you're wondering, it's this way because Ruby on Rails migrations map String fields to VARCHAR(255) by default and I never bothered to override it).

Since VARCHAR will only store the number of actual characters of the field (along with the field length), is there any distinct advantage (performance or otherwise) to using, say, VARCHAR(16) over VARCHAR(255)?

Additionally, most of these fields have indexes on them. Does a larger VARCHAR size on the field affect the size or performance of the index at all?

FYI I'm using MySQL 5.

Nebulosity answered 4/11, 2008 at 16:3 Comment(3)
@ceejayoz, stating that the accepted answer is incorrect without explaining why doesn't really help. What makes it even worse is that the accepted answer can change over time and your comment will confuse people into thinking the new accepted answer is incorrect.Maiden
@Maiden Deleted my comment as the OP apparently changed their acceptance. Good points, in the future I'll indicate which answer I'm talking about and why.Toneme
Some other answers at this duplicate question, #1262674Plasmo
K
137

In storage, VARCHAR(255) is smart enough to store only the length you need on a given row, unlike CHAR(255) which would always store 255 characters.

But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: as rows are copied from the storage engine layer to the SQL layer, VARCHAR fields are converted to CHAR to gain the advantage of working with fixed-width rows. So the strings in memory become padded out to the maximum length of your declared VARCHAR column.

When your query implicitly generates a temporary table, for instance while sorting or GROUP BY, this can use a lot of memory. If you use a lot of VARCHAR(255) fields for data that doesn't need to be that long, this can make the temporary table very large.

You may also like to know that this "padding out" behavior means that a string declared with the utf8 character set pads out to three bytes per character even for strings you store with single-byte content (e.g. ascii or latin1 characters). And likewise utf8mb4 character set causes the string to pad out to four bytes per character in memory.

So a VARCHAR(255) in utf8 storing a short string like "No opinion" takes 11 bytes on disk (ten lower-charset characters, plus one byte for length) but it takes 765 bytes in memory, and thus in temp tables or sorted results.

I have helped MySQL users who unknowingly created 1.5GB temp tables frequently and filled up their disk space. They had lots of VARCHAR(255) columns that in practice stored very short strings.

It's best to define the column based on the type of data that you intend to store. It has benefits to enforce application-related constraints, as other folks have mentioned. But it has the physical benefits to avoid the memory waste I described above.

It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR that is certainly longer than any address. And 255 is customary because it is the maximum length of a VARCHAR for which the length can be encoded with one byte. It was also the maximum VARCHAR length in MySQL older than 5.0.

Kinnikinnick answered 4/11, 2008 at 16:56 Comment(12)
I always thought 255 was used so that the length of the string could fit into a single byteTrigger
@BlueRaja: That was probably true for databases whose internal file structure encoded the length of a string in a single byte, or if they encoded short strings in a single byte. But it's no longer true for most databases.Kinnikinnick
You're saying that even if I use VARCHAR(255), MySQL will still use more than one byte to store the length of the strings, even though it knows the length of the string will never be more than one byte?Trigger
@BlueRaja: InnoDB doesn't store the length of the following varchar, it stores a series of field offsets for all fields in the row. These field offsets may be 1 byte if the total row size is less than 127 bytes, or else 2 bytes. See forge.mysql.com/wiki/MySQL_Internals_InnoDBKinnikinnick
@BlueRaja: MyISAM (for those who still use it) does store varchar lengths, and these can be stored in 1 or 2 bytes. However: "When sending a key to the handler for index_read() or records_in_range, we always use a 2-byte length for the VARCHAR to make things simpler." See forge.mysql.com/wiki/MySQL_Internals_MyISAMKinnikinnick
one question - sorting and group by on any field or the varchar field itself?Engeddi
@iamrohitbanga: Any field. As the row is passed up out of the storage layer, it is converted to a storage-independent row format in the SQL layer. This is how it's handled as result sets are prepared, so this is the representation of the row in memory during joins, sorting, and grouping.Kinnikinnick
I think the links in the above comments are outdated. @BillKarwin what about InnoDB. Does that also convert to storage independent row format in the SQL layer? The link you posted seems to be pointing to MyISAM. PS: Is it possible to update the links?Engeddi
Yes, rows are converted in the SQL layer regardless of storage engine. Oracle has discontinued the MySQL Forge wiki. Much of the content can be found in their internals manual: dev.mysql.com/doc/internals/enKinnikinnick
@iamrohitbanga, yes, InnoDB content also pads out. This is behavior of the SQL layer. It applies this transformation to data coming from any storage engine.Kinnikinnick
If a VARCHAR(n) field holds NULL will it still be padded to occupy CHAR(n) amount of space in memory?Aggressive
@AirThomas, I'm not sure - I would think not, but I can't confirm it.Kinnikinnick
F
26

In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.

Defining maximum limits for strings is a good thing to do to prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.

For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.

Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).

Examples of real data that has no reason to exceed a certain length:

  • Canadian Postal Codes are of the format A1A1A1 and are always 6 characters in length, even for Santa Claus (6 characters excludes the space that can be specified for legibility).
  • email addresses - up to 64 bytes before the @, up to 255 bytes after. Never more, lest you break the Internet.
  • North American Phone Numbers are never more than 10 digits (excluding the country code).
  • Computers running (recent versions of) Windows cannot have computer names longer than 63 bytes, though more than 15 is not recommended and will break your Windows NT server farm.
  • State abbreviations are 2 characters (like the country codes exampled above)
  • UPS tracking numbers are either 18-, 12-, 11-, or 9-characters long. The 18-character numbers start with "1Z" and the 11-character numbers start with "T" which makes you wonder how they deliver all those packages if they don't know the difference between letters and numbers.

And so on...

Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.

By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that will come back to haunt your code later.

And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.

Freddyfredek answered 11/8, 2009 at 19:46 Comment(5)
Canadian postal codes actually have 7 digits, the space in the middle is important, and should be shown on mailing labels. North american phone numbers may have more than 10 digits if there is an extention. If you are OK not being able to store phone number extensions, then 10 digits is fine, but you will probably regret it.Biogenesis
There is definitely a case for being restrictive for data integrity. Though, it's still easy to be too restrictive. Impose restrictions for data you control, and impose sane restrictions for data requirements you can't control. Your phone number and e-mail restrictions are sane (assuming you never internationalize). Your requirement that says that truncating a two character country code is the "graceful" thing is insane. You know there was an error, don't truncate and accept. If you truncate there is an extremely high probability you'll end up with an incorrect country code.Collator
Most applications will have data validation done before sending it to the database...Thermionic
Sure. Most. But I feel that here you're assuming that a developer who is developing a new application for an existing database is aware of the restrictions on the data (we're not all experts on every type of data and how it is implemented in every database). Just because you can validate data in your application doesn't mean you did.Freddyfredek
the design documentation (which surely exists) Hah. :DDisintegration
Q
16

I'm with you. Fussy attention to detail is a pain in the neck and has limited value.

Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.

If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)

However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).

Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.

However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.

Quartersaw answered 4/11, 2008 at 16:27 Comment(0)
K
13

Nowadays, i can't imagine it really matters any more.

There's a computational overhead to using variable length fields, but with the excesses of CPUs today, it's not even worth considering. The I/O system are so slow as to make any computational costs to handle varchars effectively non-existent. In fact, the price of a varchar computationally is probably a net win over the amount of diskspace saved by using variable length fields over fixed length fields. You most likely have greater row density.

Now, the complexity of varchar fields is that you can't easily locate a record via it's record number. When you have a fixed length row size (with fixed length fields), it's trivial to compute the disk block that a row id points to. With a variable length rowsize, that kind of goes out the window.

So, now you need to maintain some kind of record number index, just like any other primary key, OR you need to make a robust row identifier that encodes details (such as the block, etc.) in to the identifier. If you do that, though, the id would have to be recalculated if ever the row is moved on persistent storage. No big deal, just need to rewrite all of the index entries and make sure the you either a) never expose it to the consumer or b) never assert that the number is reliable.

But since we have varchar fields today, the only value of varchar(16) over varchar(255) is that the DB will enforce the 16 char limit on the varchar(16). If the DB model is supposed to be actually representative of the physical data model, then having fields lengths can be of value. If, however, it's simply "storage" rather than a "model AND storage", there's no need whatsoever.

Then you simply need to discern between a text field that is indexable (such varchar) vs something that is not (like a text or CLOB field). The indexable fields tend to have a limit on size to facilitate the index whereas the CLOB fields do not (within reason).

Klingel answered 4/11, 2008 at 16:50 Comment(0)
G
6

In my experience, if you allow a datatype of 255 characters, some stupid user (or some experienced tester) will actually fill that up.

Then you have all sorts of problems, including how much space you allow for those fields in reports and on-screen displays in your application. Not to mention the possibility of exceeding the per-row limit for data in your database (if you had more than a few of these 255 character fields).

Much easier to pick a reasonable limit at the beginning, then enforce that through the application and database.

Genoa answered 4/11, 2008 at 17:39 Comment(0)
W
0

It's good practice to allocate only a little over what you need. Phone numbers would never go this large.

One reason is that unless you validate against large entries, no doubt someone will use all there is. Then you might run out of space in your row. I'm not sure about MySQL limit but 8060 is the max rowsize in MS SQL.

A more normal default would be 50 imho, and then increase where need proves it.

Wideeyed answered 4/11, 2008 at 16:7 Comment(1)
Thanks. I definitely agree about it being good practice. It's the performance aspect I'd really like clarification onNebulosity
R
0

In a mysql context it can get important when working with indexes on said varchar columns, as mysql has a max. limit of 767bytes per index-row.

This means that when adding a index across several varchar 255 columns you can get to this limit rather quickly / even faster on utf8 or utf8mb4 columns as pointed out in the answers above

Reinaldo answered 15/12, 2017 at 8:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.