Querying Open Data Communities Data with SPARQL
Asked Answered
S

2

5

I'm trying to get some information from the Lower Layer Super Output Areas (LSOAs) and UK Postcodes datasets.

I need the postal code and lsoa information in a data dump for excel use.

Notation and Label of type 'Lower Layer Super Output Area'. http://opendatacommunities.org/doc/geography/lsoa/E01009437

E.g. 'lsoa' per each type 'Postcode Unit' http://opendatacommunities.org/resource?uri=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fid%2Fpostcodeunit%2FB721NB

I have no idea how to use the SPARQL engine on the site to get this information, or how to extract the information from the N-Triples file I downloaded…

Saltwort answered 17/5, 2013 at 11:48 Comment(1)
I added an answer, but aside from the "postal code and lsoa information", I wasn't sure whether you were trying to get some other bits of information out too. If you can clarify what information you need, I can update my answer.Caller
C
14

There are two main options for retrieving the data you want. In some cases, it is possible to query the data using a publicly available SPARQL endpoint. This is probably the most convenient approach, and the one to take unless there's some definite reason that you need the data locally. There are limitations to this approach, however, and in those cases, it makes sense to download the dataset and query against it locally. I'll describe the remote endpoint solution first, and then the solution using local queries. The limitations on the SPARQL endpoint (e.g., hard timeouts) mean that the first approach isn't sufficient for this particular task, so the specific answer to this question is the second approach.

I wasn't familiar with these particular datasets and ontologies before this question, so the first approach also walks though the "getting familiar with the data" process.

Using the SPARQL endpoint

There is a Open Data Communities SPARQL endpoint against which you can run queries and get some data out. I haven't looked at this data before, so rather than just posting the final answer, I'll walk through the process that I used to figure out what sort of query to run.

One of the pages you linked to, B72 1NB, mentions that the resource has type PostcodeUnit, which has the URI

http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit

Based on this, the first thing I tried was a SPARQL query to try to retrieve some postcode units, so I used the following query in the endpoint above. (If you copy and paste it in there, you'll need to remove any leading space before SELECT. I had to do that, anyhow.)

SELECT * WHERE { 
  ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit>
}
LIMIT 10

SPARQL results

in the endpoint linked above. (The LIMIT helps ensure that the results come back in a timely manner, and that we're not asking the server to do too much.) This produces results like

--------------------------------------------------------------
| postcodeUnit                                               |
==============================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY256SA> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY209DR> |
--------------------------------------------------------------

The B72 1NB page shows its lsoa as Birmingham 006C. The IRI for the lsoa property is (and you can see this in the data you downloaded)

http://opendatacommunities.org/def/geography#lsoa

so we extend the SPARQL query to

SELECT * WHERE { 
  ?postcodeUnit
    a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> ;
    <http://opendatacommunities.org/def/geography#lsoa> ?lsoa .
}
LIMIT 10

SPARQL results

The results are like this:

-----------------------------------------------------------------------------------------------------------------------------
| postcodeUnit                                               | lsoa                                                         |
=============================================================================================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | <http://opendatacommunities.org/id/geography/lsoa/E01029309> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | <http://opendatacommunities.org/id/geography/lsoa/E01029706> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | <http://opendatacommunities.org/id/geography/lsoa/E01018373> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014172> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | <http://opendatacommunities.org/id/geography/lsoa/E01018514> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | <http://opendatacommunities.org/id/geography/lsoa/E01029175> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  | <http://opendatacommunities.org/id/geography/lsoa/E01014204> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | <http://opendatacommunities.org/id/geography/lsoa/E01029225> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SW65TP>  | <http://opendatacommunities.org/id/geography/lsoa/E01001950> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF15AX>  | <http://opendatacommunities.org/id/geography/lsoa/E01014155> |
-----------------------------------------------------------------------------------------------------------------------------

You can use prefixes in your query if you want to make it a bit more readable and concise:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit ;
    geo:lsoa ?lsoa .
}
LIMIT 10

SPARQL results

The results will be the same, of course. At the bottom of each of those results pages, you can download the results in a number of other formats. One of the formats is CSV, and you might have luck importing that directly into a spreadsheet (you said you wanted to use the data in Excel).

Discussion in the comments pointed out that the sheer number of PostcodeUnits makes the result set very large. The UK Postcodes dataset contains four types of resources, in order of increasing size: Postcode Units, Postcode Sectors, Postcode Districts, and Postcode Areas. There are 1686911, 10833, 2087, and 120 resources of these types, respectively. As I understand the clarification in the comments, the idea is to associate these with Lower Layer Super Output Areas (LSOAs), e.g., Birmingham 006C. Individual Postcode Units are associated with LSOAs, but the higher level postcode regions are not. Each Postcode Unit is within its sector, district, and area. For instance, TA21 9HB is within TA, TA21 9, and TA21. Using this information, we can ask for postcode units and their corresponding district (or sector, or area), as well as their LSOA, and report just the district and the LSOA, ignoring the unit itself. For instance:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT DISTINCT ?district ?lsoa 
WHERE { 
  ?postcodeunit a pc:PostcodeUnit ;
                geo:lsoa ?lsoa ;
                sr:within ?district .
  ?district a pc:PostcodeDistrict .
}
LIMIT 10 

SPARQL results

Now, there are 34378 LSOAs in the dataset, so there's still lots of data to be selected, and trying to pull down the text results for all distinct losa/district mappings still results in a timeout. In fact, since every LSOA is associated (I expect) with some district, there are probably as many results in the output as there are LSOAs.

It looks like this is the point where we start to hit response size limits and timeouts for the SPARQL endpoint, and need to start accessing the data locally. The postcode data alone is 5.6 GB though, so this isn't a wonderful solution.

But, if you're willing to take a representative LSOA for each district, we can use SPARQL subqueries to pull these out, as in the following query which first retrieves all the postcode districts, and then for each one, finds a single LSOA that some postcode unit in the district has. I don't know whether this is an acceptable result, but you end up with an LSOA for each district, and the results are small enough (there are 2087 rows, the same as the number of districts) that they can be pulled down in any of the results formats (including CSV).

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT ?region ?lsoa 
WHERE { 
  {
    SELECT ?region WHERE { 
      ?region a pc:PostcodeDistrict .
    }
  }

  {
    SELECT ?lsoa WHERE { 
      ?postcodeunit a pc:PostcodeUnit ;
                    geo:lsoa ?lsoa ;
                    sr:within ?region .
    }
    LIMIT 1 
  }
}

SPARQL results

Using TDB locally

There are limitations to using the SPARQL endpoint such as the timeouts encountered above. In these situations, it's not too hard to download the data and get it into a Jena TDB store and to query using tdbquery. The UK postcodes page has the download link for zipped n-triples. After downloading this data, (and having Apache Jena 2.10 installed), I ran (on a Unix system):

$ tdbloader2 --loc tdb dataset_data_postcodes_20130506183000.nt

where tdb is a local directory I make to contain TDB's indexes. Loading the data takes a while (1125 seconds here), as does indexing. Once everything is loaded up, I stored the following query in a file named postcodes.sparql, and ran the query with

$ tdbquery --loc tdb --results CSV --query postcodes.sparql > unit_lsoa.csv

to generate results in CSV format, stored in the file unit_lsoa.csv. Here are the first few lines:

$ head -5 unit_lsoa.csv 
postcodeUnit,lsoa
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AE,http://opendatacommunities.org/id/geography/lsoa/E01023667
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AG,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AJ,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AR,http://opendatacommunities.org/id/geography/lsoa/E01023684

Now, there were 1686911 defined postcode units, so I initially expected that there would be the same number of lines in unit_lsoa.csv. However, there are about 200,000 fewer. (wc -l prints the number of lines in a file.)

$ wc -l unit_lsoa.csv 
1440143 unit_lsoa.csv

As it turns out, some of the postcode units do not have associated LSOAs. I checked this by running the query

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit .
    FILTER NOT EXISTS { ?postcodeUnit geo:lsoa ?lsoa }
}

stored in the file postcodes_without_lsoa.sparql:

$ tdbquery --loc tdb \
    --results CSV \
    --query postcodes_without_lsoa.sparql > unit_without_lsoa.csv

Sure enough, there are about 200,000 lines in unit_without_lsoa.csv:

$ wc -l unit_without_lsoa.csv
246770 unit_without_lsoa.csv

The sum of 1440143 and 246770 is 1686913 which is exactly the number of postcodes (plus 2 lines for the headers in each CSV file). Mission accomplished!

Caller answered 17/5, 2013 at 13:46 Comment(9)
Awesome thank you, this would have taken me ages to figure out.Saltwort
@Saltwort Great! Glad I could help. It this answer meets your needs, do please say "thank you" by accepting (clicking the green arrow near the upvote/downvote buttons), too! Happy SPARQLing!Caller
I have a bit of an issue getting the data out now. It seems there are so many lines of data that the request times out when I try to download the CSV, I put a limit on 1686911. I don't actually need the full postcode Unit, Postcode District would be enough, but I can't see an lsoa tag on that level?Saltwort
@Saltwort I looked at the ontology a bit more, and found a way to write a query associating LSOAs with different postal regions (units, sectors, districts, and areas). It's still a bit too much data, and results in a timeout. Can you be any more specific about exactly what data you need? Maybe we can narrow it down a bit more?Caller
Thanks for following up, i'll need to look at my notes on Monday and reply back, I really appreciate your help!Saltwort
I have a list of 32844 lines in excel, each line representing a LSOA. These LSOAs are grouped into something called "CCG"s, where sometimes many different LSOAs are in the same CCG. What I'm trying to achieve is a list of postcodes per CCG. Since this information is not directly available, I would require a list of postcodes per LSOA in order to then link to CCG. Since it sounds like this dataset would be too large to extract from the sparql endpoint online, could you please tell me how I might extract it from the 5gb file I've downloaded? One column LSOA, one column Postcode units?Saltwort
@Saltwort OK, to process the data locally, I think there are two approaches: (i) thin down the data to contain only the triples that will actually be useful (so that the size of the data is much less than 5GB), and run the query locally on an in-memory model using Jena's ARQ; (ii) keep the data as is, load with Jena's TDB, and query against that. As a learning experience, I'm trying option (ii), and will update the answer once I've done it. I'll also take a look at (i), if I have time.Caller
@Saltwort OK, the answer is updated with a TDB-based approach. I think this produces the data you were trying to get. It took a while to load and query; I can send you the results via email (or some other electronic means) if you contact me through email (my address is on my profile page), so you don't have to repeat all that work. Once everything was up and loaded, though, the queries ran fairly quickly; if you have other queries to run, having the data locally would be very useful.Caller
Thanks Joshua, I've emailed you.Saltwort
S
0

you can use a web service to get this information you can comobie UK Postal Code(ex. ZE1 0AE) ,Sector, District, City, and Wards Boundaries

https://www.mashape.com/vanitysoft/uk-boundaries-io

here is example from a query of postal district TA2 returns collection of polygons(GeoJson) of sectors that make up TA2 district.TA2 District GeoJson

Scintilla answered 23/8, 2015 at 0:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.