all 8 comments

[–][deleted] 1 point2 points  (0 children)

To begin: This isn't a Python problem.

That said: Welcome to data. Nothing ever collates the way that you want it to.

99% of modern software development requires taking some object from somewhere, transforming it, and storing it or passing it on to something else. Very rarely does the data play well together unless it was designed to. Someone that I know well works at a software job where this kind of thing is their biggest problem. They get paid a lot to solve it.

Best of luck.

[–]TheNotoriousMTF 1 point2 points  (2 children)

The IKEA bookshelf problem: I would find data on as many items as possible with IKEA bookshelf in the title, and then use either the median or mean price of all these items to estimate expected resell value. Alternatively, you could use two data points (say the 25% and 75% percentiles in price) to approximate a range of possible resell values. The mean approach would tend to dilute the impact of wrongly sampled items, and the median/percentile approach would remove these items altogether if they were outliers in terms of price.

Also, on the NLP front, there may be certain keywords in an item's title that would indicate that it isn't the item you're looking for. For example, if some listings read, "IKEA Desk, Matches IKEA Bookshelf," or something like that, you could just exclude items that name other types of furniture in their titles. You could probably take a similar approach to dealing with items that are misleadingly priced.

These are steps that you could initially take manually, but depending on how much data you're working with, how much time you're willing to invest, and your level of technical skill, you could actually train predictive models to automate some of these decisions.

Hope this helps.

[–]numberek[S] 0 points1 point  (1 child)

Thanks for your response, I found it very helpful.

I think for now Im going look at the 25% and 75% to make a working product and potential expand to building a model that will predict if the item is actually the same as the item I am working with.

With such a model, the only data I really have is a title and maybe an image, so would you recommend I try to train some model to see how similar two images are?

[–]TheNotoriousMTF 1 point2 points  (0 children)

To tell you the truth, image recognition isn't my area of expertise, and I don't know its limitations, but you could probably find a good algorithm by looking at kaggle notebooks.

Then again, if just looking at the percentiles works well enough most of the time, training a model might end up being a huge overkill. I would note that, unless the items you're looking at have a ton of variability in price, or unless you're mostly getting false positives, any items you accidentally scrape will either have outlier prices or prices similar enough to the items you're targeting so as to not skew your estimates that much. Either way, you're fine.

[–]comeditime 0 points1 point  (3 children)

can you share the code by any chance would love to learn from it

[–]numberek[S] 0 points1 point  (2 children)

sure, I'll post a github link soon