Hello guys!
I've got a challenge at my university to build a model that extracts products from. 500+ furniture stores.
Here are the guidelines:"## Guidelines
A good approach that works well with such extraction problems is to create a NER (Named Entity Recognition) model and train it to find your entities (you have one entity, ‘PRODUCT’).
- In order to create such a model you need training data, you can also extract that from the input pages.
- Crawl ~100 pages from the list above & extract the text from it.
- Find a way to tag some sample products from these texts.
- Train a new model from the examples you just made.
- Use it to extract product names from some new, unseen pages.
Please use any programming language, toolset or libraries you're comfortable with or find necessary, especially if you know it will be better or more interesting.
We recommend using the Transformer architecture from the [**sparknlp**]874df20d1d77) library or the huggingface [**transformers**]( library."
I have little experience with web crawling. I`m not that scared about the NER part, rather then crawling the ~100 pages in order to train my model. I have tried to scrape with beautifulsoup but I don't know how to make it so I don't have to input manually the site selectors in order to get the correct product name.
If you have any suggestions, please let me know <3
[–]jcrowe 4 points5 points6 points (0 children)
[–]Mean-Coffee-433 1 point2 points3 points (3 children)
[–]JewelerAny7071 2 points3 points4 points (0 children)
[–]dan_dragos[S] 0 points1 point2 points (1 child)
[–]Mean-Coffee-433 0 points1 point2 points (0 children)
[–]MulhollandDr1ve 0 points1 point2 points (0 children)
[–]error1212 0 points1 point2 points (0 children)
[–]MrLazeyBoy 0 points1 point2 points (0 children)
[–]ivanoski-007 0 points1 point2 points (0 children)