Is it a good idea to use an integer column for storing US ZIP codes in a database?
Asked Answered
A

12

62

From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:

  1. Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
  2. Numeric, i.e. 32-bit integer

Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:

  • It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
  • It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.

Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.

Is there anything that would discourage using int as a datatype for US-only ZIP codes?

Atwitter answered 21/5, 2009 at 15:10 Comment(3)
I could swear this is a many-times dupe, but I'm having trouble finding them...Skijoring
@rmeador: #311040 is very similar, while #748302 touches on the topic as well.Tullus
Slapping ToString on is a bug waiting to happen: what if 00001 becomes a zip code? Then you can't tell between 10001 and 00001-0001.Condyloma
W
142

A numeric ZIP code is—in a small way—misleading.

Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.

Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.

Since ZIP codes aren't numbers (they just happen to be coded with a restricted alphabet), I suggest avoiding a numeric field. Saving one byte isn't worth much, and I think meaning is more important than the byte.


Edit.

"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.

Wrasse answered 21/5, 2009 at 15:15 Comment(3)
If you do internationalize your app with .uk or .nl your which are alphanumeric you are bitten in the ass...Edger
Well, actually, you could potentially want to do numeric operations on zip codes, like for instance selecing all zip codes in a range from 0..90210 for instance. (This could be done to divide a country into different segments of users, for instance). Having the zip codes as strings then could give you some problems (turns out converting strings to integers in standard SQL isn't as straightforward as you'd think), and inputting integers into range comparison (like a BETWEEN statement) will just silently fail... I'm not saying you shouldn't use strings, just that there may be some exceptions.Lafontaine
A Zipcode s not a number. It's a string of digits. The same goes for phone numbers and social security numbers. Storing them as numbers is misleading, sometimes harmful.Generatrix
S
29

Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.

Ski answered 21/5, 2009 at 15:12 Comment(3)
Not that Canada's the only other place n the world, just using it as an example.Ski
even if you only need US ZIP codes now, as soon as marketing/sales at your company realizes they can make money elsewhere, you'll need to support other ones :) It takes no extra effort to support it now, but will take a lot later.Skijoring
Plus the leading-zero issue means you're not storing the customer's real data, you're storing some adjusted version of it that will require mental gymnastics to remember you need to always look out for too-short codes. And what if they included the +4? There's no advantage to storing them in numeric format that I can see. The only operation I've ever done on zip codes is to use them in long/lat calculations and that requires you to treat them like strings anyway: you just grab the first 3 characters (leading zero problem again).Ski
C
18

Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.

Here are validation regexes for UK, US and Canada.


Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?

Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.

Condyloma answered 21/5, 2009 at 15:12 Comment(2)
It's cool --- we've got and used regexes in the past for validationg. I'd have to anyway, as users still would input via textbox string, whether it would be parsed and stored numerically or not, I'd still have to validate the textbox before submission.Atwitter
I work a lot with addresses, in particular address cleansing. Your point regarding removal of leading zeros is far more important in real terms than your the point regarding the semantics of whether or not it's a number. When it comes to data cleansing, needing to know if the data was entered incorrectly or whether it is missing leading zeros is far more time consuming than you might first imagine.Mantra
K
9

Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.

Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.

Kirkwood answered 21/5, 2009 at 15:14 Comment(0)
M
7

From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.

However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.

It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.

One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.

Mantra answered 21/5, 2009 at 15:54 Comment(2)
Yes, I can't count the number of times I've put SW1A 1AA because of excessive validation. If you can subscribe to an address database, you'll enormously improve the quality of your data. You'll also be able to identify gaps in the database (and so continue improving it).Condyloma
We subscribe to the Canada Post database for ours and the basic database is only the beginning. It takes quite a bit of fuzzy logic and some creative algorithms to make it really useful.Mantra
S
7

No, because

  • You never do math functions on zip code
  • Could contain dashes
  • Could start with 0
  • NULL values sometimes interpreted as zero in case of scalar types like integer (e.g. when you export the data somehow)
  • Zip code, even if it's a number, is a designation of an area, meaning this is a name instead of a numeric quantity of anything
Smitten answered 21/3, 2016 at 18:54 Comment(1)
Should be the accepted answer because shows different scenarios where an integer is invalid. Other answers focus this question as a performance related issue, but the point is about the correctness of using an integer for storing zip codes!Osteotomy
Z
2

Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.

Hope this helps,

Bill

Zlatoust answered 21/5, 2009 at 16:10 Comment(0)
C
1

ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:

10022-SHOE

https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html

Realistically, a lot of business applications will not need to support this edge case, even if it is valid.

Cristacristabel answered 8/5, 2010 at 20:11 Comment(1)
(thx for edit... can't believe I wrote the answer 10+ years ago)Cristacristabel
B
0

Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.

Beabeach answered 21/5, 2009 at 15:14 Comment(4)
If I am not mistaken, zip codes in the US can start with a 0, so that may make the integer out of the running.Kirkwood
@Kirkwood - true, but if you are assuming US only anyways, then you can always pad to 5 digits with leading zeros for display purposes.Beabeach
Maybe, although that loses redundancy and possible error-checking. Are there ZIP codes that start with four zeros (in Maine or somewhere)? If so, how do you tell the difference between, maybe, a Bangor ZIP+4 and a ZIP from somewhere that isn't New England?Silurid
I believe the smallest US zip code has two leading zeros - 00210, Portsmouth, NH. So, even with zip + 4, you can still know that a zip must be either 5 digits or 9 digits. If it were 7 or 8 digits you would know it was a zip + 4 with one or two leading zeros. So, it would of course be a PITA, but it's possible to do without ambiguity if you really wanted to.Beabeach
L
0

If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)

Luedtke answered 13/1, 2010 at 9:14 Comment(0)
H
0

I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.

From the docs:

You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…

Heteroplasty answered 24/2, 2018 at 16:48 Comment(0)
C
0

I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation

Cephalad answered 4/3, 2022 at 12:15 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Younts

© 2022 - 2024 — McMap. All rights reserved.