Algorithm to get US zip codes from gis x,y coordinates
Asked Answered
B

3

6

I have a database of many tens of thousands of events that occurred at specific geographic locations within the United States. The data include x,y coordinates for each event, encoded using the NAD83 reference system. I want to write or use an algorithm to reliably get the US zip code associated with each NAD83 x,y coordinate.

I do not yet have zip code definitions using the NAD83 reference system. And I have never done this kind of programming before. But it just seems like it would be intuitively simple to find out whether a given x,y coordinate is located within a geometric shape of a US zip code defined using the same NAD83 reference system.

I am interested in the following:

  1. Where do I get reliable US Zip Code definitions in the NAD83 reference system format?
  2. Where can I find example code for an algorithm to find the zip code given an x,y coordinate?

I am interested in any links to instructional articles/tutorials, example code, and NAD83 zip code boundary definition data. I am doing web searches, but I figured that people on this site might be able to give me more of an expert's guide.

If the code you provide is not written in Java, I could take code written in another language and adapt it to Java for my purposes. I do not have database software installed in my computer because I just use csv or text files as inputs into my Java applications. If you have some database that you suggest I use, I would need links to instructions for how to get the data into a format that I can import into a programming language such as Java.

Finally, the street addresses in my dataset do not include zip codes, and the street addresses are written haphazardly, so that it would be very difficult to try to clean the address data up enough to try to get zip codes from the addresses. I can isolate the data to several adjacent cities, in perhaps a couple hundred zip codes, but I think that the NAD83 x,y coordinates are my best shot at deriving the zip code in which each event in my dataset occurred. I want to link my resulting zip code by zip code analyses with other data that I get about each zip code from sources like the US Census, etc.

Boxcar answered 7/1, 2012 at 2:35 Comment(2)
This may or may not be useful: #3565352Halvorsen
Given that zip code boundaries are basically polygons (albeit on a sphere), you would probably use an algorithm like the one at en.wikipedia.org/wiki/Point_in_polygon to determine if you are within a particular zip code, perhaps selecting a set of close candidate zip codes (ones whose vertices are close to the point in question) as an optimization.Tildie
U
1

i don't know where to get the ZIP code, but i think you can google it out, the ZIP code of each state.

and to question (2), first you'll need the geographic information, i.e. the boundary of each state. then you just enumerate all the points(x,y) and determine which polygon it's in.

Here is a sample code, it was written for SGU124.

#include <map>
#include <cstdio>
#include <cstring>
#include <algorithm>

#define MAXN 10005

using namespace std;

struct pnt{
    int x,y;
};
struct seg{
    pnt a,b;
}   s[MAXN];
int n;
pnt p;
int h[MAXN<<1];
int k[MAXN<<1];

void work(){
    int i,x,y,c = 0;
    memset(h,0,sizeof(h));
    memset(k,0,sizeof(k));
    for (i=0;i<n;i++){
        if (s[i].a.x<=p.x && p.x<=s[i].b.x && s[i].a.y<=p.y && p.y<=s[i].b.y){
            printf("BORDER\n");
            return;
        }
        if (s[i].a.x==s[i].b.x){
            x = s[i].a.x;
            y = p.y - p.x + x;
            if (x<=p.x && s[i].a.y<=y && y<=s[i].b.y){
                h[x+MAXN] = 1;
                if (y==s[i].a.y) k[x+MAXN] |= 1;
                    else if (y==s[i].b.y) k[x+MAXN] |= 2;
            }
        }
        else{
            y = s[i].a.y;
            x = p.x - p.y + y;
            if (x<=p.x && s[i].a.x<=x && x<=s[i].b.x){
                //printf("%d %d %d %d\n",s[i].a.x,s[i].a.y,s[i].b.x,s[i].b.y);
                h[x+MAXN] = 1;
                if (x==s[i].a.x) k[x+MAXN] |= 4;
                    else if (x==s[i].b.x) k[x+MAXN] |= 8;
            }
        }
    }
    for (i=p.x;i>=-10000;i--){
        //if (h[i+MAXN]>0) printf("@ %d %d\n",i,k[i+MAXN]);
        if (k[i+MAXN]!=9 && k[i+MAXN]!=6) c += h[i+MAXN];
    }
    //printf("p @ %d %d ",p.x,p.y);
    if (c%2) printf("INSIDE\n");
        else printf("OUTSIDE\n");
}

int main(){
    freopen("sgu124.in","r",stdin);
    int i;
    while (~scanf("%d",&n)){
        for (i=0;i<n;i++){
            scanf("%d%d",&s[i].a.x,&s[i].a.y);
            scanf("%d%d",&s[i].b.x,&s[i].b.y);
            if (s[i].a.x>s[i].b.x || s[i].a.y>s[i].b.y) swap(s[i].a,s[i].b);
        }
        scanf("%d%d",&p.x,&p.y);
        work();
        //break;
    }
    return 0;
}
Ursuline answered 7/1, 2012 at 3:9 Comment(3)
Thank you. What language is your sample code written in? Knowing the language would help enable me to port it to Java. Also, I do not know your location, but there are perhaps many thousands of zip codes within a state. If someone can help me figure out where to download the polygonal definitions of zip codes in NAD83 format, I guess you are suggesting that I simply iterate through the polygon definitions to see where there is a match. I am not really conceptuallizing the input format of the polygon data, but that might be easier if you tell me what language your code is in.Boxcar
@Boxcar the code is written in C++(though a lot of c stuff) and for each location with an unique ZIP code, you need the boundary which can be presented as a list of points. e.g. (0,0)->(0,1)->(1,1)->(1,0) just describes a rectangle. if it's really hard to get some well organized data, and you are patient enough, you may draw the boundary yourself.Ursuline
I am marking this as the answer because it is the closest to giving me the tools for what I asked. Here is a link to ascii text data for zip codes that could be connected to an algorithm such as you suggest: census.gov/geo/www/cob/ascii_info.html I may actually do it in a different way. And I may make another posting with a more focused question. This posting was kind of general, and I think it has been answered on the level at which it was asked. Thank you for your assistance. +1, and a check mark for answering it.Boxcar
D
4

You can use GeoTools in java. Here is a an example the searches for a point in a shapefile.

// projection/datum in SR-ORG:7169 (GCS NAD83)
File shapeFile = new File("zt08_d00.shp");
FileDataStore store = FileDataStoreFinder.getDataStore(shapeFile);
SimpleFeatureSource featureSource = store.getFeatureSource();
// Boulder, CO
Filter filter = CQL.toFilter("CONTAINS(the_geom, POINT(-105.292778 40.019444))");
SimpleFeatureCollection features = featureSource.getFeatures(filter);
for (SimpleFeature f : features) {
    System.out.println(f.getAttribute('NAME'));
}

I grabbed a shapefile from the U.S. Census Bureau's collection of 5-Digit ZIP Code Tabulation Areas from the 2000 Census. I just used a single file for the state of colorado. You would need merge these into a single FeatureSource. Running this outputs 80302 for Boulder, CO.

GeoTools also allow you to convert between projections if needed. Luckily these shapefiles are already in NAD83.

Deploy answered 7/1, 2012 at 6:29 Comment(3)
Thank you. +1. I was actually reading the GeoTools tutorial when you were writing your suggestion. And I noticed that it has some licensing restrictions. Currently, I do not intend to make commercial use of this application. However, I try to only use code that comes without licensing restrictions, so that I am free to make commercial use of software later. Also, GeoTools requires that dependencies be added to my code. I have found a shape file with zip code information for the region I am looking at. Do you know how to open a shape file in java code without adding a dependency?Boxcar
GeoTools is LGPL, so you can use it in a commercial product without releasing your code. All the other java shapefile libraries I'm familiar with are commercial.Deploy
I installed and used GeoTools to load a shape file of zip code boundaries. But it does not give me numerical data I need to determine the zip code within which each x,y coordinate in my data set falls. It shows this answer graphically, but not in terms of data I can manipulate in arrays. I need to tag each record in my dataset with a zip code so that I can do quantitative analysis of what happens within each zip code, using an algorithm like those proposed by other responses to this post. I might frame this as a different post if someone is not able to answer before I next login.Boxcar
U
1

i don't know where to get the ZIP code, but i think you can google it out, the ZIP code of each state.

and to question (2), first you'll need the geographic information, i.e. the boundary of each state. then you just enumerate all the points(x,y) and determine which polygon it's in.

Here is a sample code, it was written for SGU124.

#include <map>
#include <cstdio>
#include <cstring>
#include <algorithm>

#define MAXN 10005

using namespace std;

struct pnt{
    int x,y;
};
struct seg{
    pnt a,b;
}   s[MAXN];
int n;
pnt p;
int h[MAXN<<1];
int k[MAXN<<1];

void work(){
    int i,x,y,c = 0;
    memset(h,0,sizeof(h));
    memset(k,0,sizeof(k));
    for (i=0;i<n;i++){
        if (s[i].a.x<=p.x && p.x<=s[i].b.x && s[i].a.y<=p.y && p.y<=s[i].b.y){
            printf("BORDER\n");
            return;
        }
        if (s[i].a.x==s[i].b.x){
            x = s[i].a.x;
            y = p.y - p.x + x;
            if (x<=p.x && s[i].a.y<=y && y<=s[i].b.y){
                h[x+MAXN] = 1;
                if (y==s[i].a.y) k[x+MAXN] |= 1;
                    else if (y==s[i].b.y) k[x+MAXN] |= 2;
            }
        }
        else{
            y = s[i].a.y;
            x = p.x - p.y + y;
            if (x<=p.x && s[i].a.x<=x && x<=s[i].b.x){
                //printf("%d %d %d %d\n",s[i].a.x,s[i].a.y,s[i].b.x,s[i].b.y);
                h[x+MAXN] = 1;
                if (x==s[i].a.x) k[x+MAXN] |= 4;
                    else if (x==s[i].b.x) k[x+MAXN] |= 8;
            }
        }
    }
    for (i=p.x;i>=-10000;i--){
        //if (h[i+MAXN]>0) printf("@ %d %d\n",i,k[i+MAXN]);
        if (k[i+MAXN]!=9 && k[i+MAXN]!=6) c += h[i+MAXN];
    }
    //printf("p @ %d %d ",p.x,p.y);
    if (c%2) printf("INSIDE\n");
        else printf("OUTSIDE\n");
}

int main(){
    freopen("sgu124.in","r",stdin);
    int i;
    while (~scanf("%d",&n)){
        for (i=0;i<n;i++){
            scanf("%d%d",&s[i].a.x,&s[i].a.y);
            scanf("%d%d",&s[i].b.x,&s[i].b.y);
            if (s[i].a.x>s[i].b.x || s[i].a.y>s[i].b.y) swap(s[i].a,s[i].b);
        }
        scanf("%d%d",&p.x,&p.y);
        work();
        //break;
    }
    return 0;
}
Ursuline answered 7/1, 2012 at 3:9 Comment(3)
Thank you. What language is your sample code written in? Knowing the language would help enable me to port it to Java. Also, I do not know your location, but there are perhaps many thousands of zip codes within a state. If someone can help me figure out where to download the polygonal definitions of zip codes in NAD83 format, I guess you are suggesting that I simply iterate through the polygon definitions to see where there is a match. I am not really conceptuallizing the input format of the polygon data, but that might be easier if you tell me what language your code is in.Boxcar
@Boxcar the code is written in C++(though a lot of c stuff) and for each location with an unique ZIP code, you need the boundary which can be presented as a list of points. e.g. (0,0)->(0,1)->(1,1)->(1,0) just describes a rectangle. if it's really hard to get some well organized data, and you are patient enough, you may draw the boundary yourself.Ursuline
I am marking this as the answer because it is the closest to giving me the tools for what I asked. Here is a link to ascii text data for zip codes that could be connected to an algorithm such as you suggest: census.gov/geo/www/cob/ascii_info.html I may actually do it in a different way. And I may make another posting with a more focused question. This posting was kind of general, and I think it has been answered on the level at which it was asked. Thank you for your assistance. +1, and a check mark for answering it.Boxcar
G
0

You mentioned that you have addresses that you might be able to use. In that case, an address verification service will allow you to programmatically find the ZIP codes based on the the address and city/state. Even if poorly formatted, the address data could likely get you to 90 or 95% of your goal, leaving the remainder to either clean up and reprocess or try to use the coordinates to determine.

SmartyStreets will take an uploaded CSV file with your data and perform address validation (correct and standardize the address) and then verify the addresses using data from the USPS. One unique feature of SmartyStreets is that they don't charge anything for bad addresses. This would allow you to format and process various permutations of each address (to try to account for the haphazard data) and only pay for it if a positive match is resolved.

In the interest of full disclosure, I am the founder of SmartyStreets. We provide street address verification.

Gabriellegabrielli answered 17/1, 2012 at 23:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.