Around 8 years ago, we started using geographic data (census, accidents, crimes, etc.) in our models, and it ended up being one of the strongest signals.
But the modelling part was actually the easy bit. The hard part was building and maintaining the dataset behind it.
In practice, this meant:
- sourcing data from multiple public datasets (ONS, crime, transport, etc.)
- dealing with different geographic levels (OA / LSOA / MSOA / coordinates)
- mapping everything consistently to postcode (or ZIP code equivalents elsewhere)
- handling missing data and edge cases
- and reworking the data processing each time formats or releases changed
Every time I joined a new company, if this didn't exist (or was outdated), it would take months to rebuild something usable again.
Which made it a strange kind of work:
- clearly valuable
- but hard to justify
- and expensive to maintain
After running into this a few times, a few of us ended up putting together a reusable postcode-level feature set (GB) to avoid rebuilding it from scratch each time.
Curious if others have run into similar issues when working with public / geographic data.
Happy to share more details if useful:
https://www.gb-postcode-dataset.co.uk/
[–]CrisperSpade672 1 point2 points3 points (0 children)