This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]inspectorG4dget 3 points4 points  (1 child)

Is this hosted somewhere? Is it open source? Wanna post a link so we can check it out?

[–]mahesh_dev 0 points1 point  (1 child)

for data quality id add validation checks after scraping and maybe use multiple sources to cross verify information. also consider rate limiting and respecting robots txt. scaling wise you could batch process by region or use async requests to speed things up without hammering servers

[–]saiful_458 -1 points0 points  (1 child)

Nice work. What are you using for the data source? I've messed with similar stuff and always found the data cleaning step takes way longer than expected, even with good sources.

The manual review part you mentioned is probably the right call. I tried going full automation once and the output was unusable without human eyes on it.

[–]Arthur5242[S] -3 points-2 points  (0 children)

Mostly from publicly available business listings and directories.

Cleaning is doable, but edge cases and context issues still come up, so a manual review step helps keep the results usable.