Ordinance Survey - Address Base Premium Processing mutli-schema documents easily

  • 15 February 2022
  • 0 replies

Userlevel 2

Ordnance Survey (OS) produce a suite of products for the UK Market, One of these is Address Base Premium (ABP), Which is a set of details that describe buildings, businesses, residential and items you’d find on a map in detail, such as Lat/Long data, how the royal mail refers to it, and how the local authorities/councils describe and classify those items.

Sounds good right? Well its not so nice of a set of data to deal with, It is shipped in 5 km batches of data in a single csv file, across 10 different schema patterns within it.. Nightmare? Nope! You know the secret of streamsets!

Streamsets doesn't care about schema on read, That is the key to unlocking this… In the ABP Files, the first column has a record identifier, This tells us which schema and rules to process.

This pipeline makes such a difficult issue to deal with normally, clear and transparent.

  1. You can watch a record come in from the raw file, read it with no header,
  2. Look at the first column. Process that to a given lane,
  3. That lane will apply the correct field names for that field.(as per OS Specification)
  4. Then we move into our reusable fragment

What does the fragment do?

  1. we need to make our fields the appropriate data type, UPRNs which is the key data (Unique Property Reference Number) needs to be held as a LONG (as they are rather long numbers….one to watch out for!) instead of having to customise this each time, and then finding that another field needs treatment.

    We put a Field Type Converter and the JDBC Destination in a fragment, Which allows us to quickly keep all the lanes in version control, and means we configure it once, use it many times.
  2. Then we post it into a database table, controlled by a connection that is managed on the dataops control hub

In no time at all, multiple 5km squares are all split into their components and put back together again in the right order into a database (which in my case is MySQL) 

For more information on Address Base Premium https://www.ordnancesurvey.co.uk/business-government/products/addressbase-premium
(Not an Ad, I just love that dataset.)

0 replies

Be the first to reply!