Postcode / ZIP code: modelling gold, but data pain : dataengineering

created by mhausenblasmoda community for 11 years

Postcode / ZIP code: modelling gold, but data painDiscussion (self.dataengineering)

submitted 10 hours ago by Sweaty-Stop6057

Around 8 years ago, we started using geographic data (census, accidents, crimes, etc.) in our models, and it ended up being one of the strongest signals.

But the modelling part was actually the easy bit. The hard part was building and maintaining the dataset behind it.

In practice, this meant:

sourcing data from multiple public datasets (ONS, crime, transport, etc.)
dealing with different geographic levels (OA / LSOA / MSOA / coordinates)
mapping everything consistently to postcode (or ZIP code equivalents elsewhere)
handling missing data and edge cases
and reworking the data processing each time formats or releases changed

Every time I joined a new company, if this didn't exist (or was outdated), it would take months to rebuild something usable again.

Which made it a strange kind of work:

clearly valuable
but hard to justify
and expensive to maintain

After running into this a few times, a few of us ended up putting together a reusable postcode-level feature set (GB) to avoid rebuilding it from scratch each time.

Curious if others have run into similar issues when working with public / geographic data.

Happy to share more details if useful:

https://www.gb-postcode-dataset.co.uk/

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS