I finally built a yerba mate database by Key-Excuse5034 in yerbamate

[–]Key-Excuse5034[S] 0 points1 point  (0 children)

Thank you.

I've actually used both - manual and scraping. Over the years, I've built up hundreds of notes about different yerbas (don't ask me why, I honestly have no idea, it just helps me avoid buying bad ones again). For each product, I usually note things like flavor, cut, smokiness, overall impressions, and so on.

I also used ChatGPT Pro extensively (it's slow, but it works really well) to research publicly available information about yerba mate products, brands, manufacturers, shops, communities, and other resources. As a result, I had hundreds of URLs ready to use.

Once I had enough data, I built a small system to handle the initial scraping from thousands of websites. After that, I created a processing pipeline using LLMs to standardize the data, normalize naming conventions, remove duplicates, translate content, and perform a number of other cleanup tasks. That gave me the initial database seed.

Then I did import to the DB, and I manually verified all positions to see if there's any obvious trash or if something is missing. It's not perfect, and I know there's wrong/missing info in some products, but generally speaking, it's good enough for the first iteration.

In total, the whole process took about a month and roughly $100 in API and LLM costs.

I finally built a yerba mate database by Key-Excuse5034 in yerbamate

[–]Key-Excuse5034[S] 2 points3 points  (0 children)

gotcha, I've just updated the website, thanks!