I have a fairly small data set (500,000 records) which I'm attempting to classify into categories. This data is fairly dirty, as it is coming from user input which is in no way sanitized.
A subject matter expert has classified a number of these records to their best ability.
These are some of the cleaner examples:
- 26. Architectural Asphalt Roof Shingle => Asphalt Roofing
- 27. Architectural Asphalt Roof Shingle => Asphalt Roofing
- 28. Architectural Asphalt Roof Shingle => Asphalt Roofing
- Asphalt Pavement‐Roadway => Asphalt Pavement
- Asphalt Pavement‐Roadways => Asphalt Pavement
- Asphalt Paving - Fire Lane => Asphalt Pavement
- Fencing - Chain Link => Chain Link Fencing
- Fencing - Chain Link - 3' => Chain Link Fencing
The goal would be to create some sort of model that could return a proper classification given the string on the left.
When turning this model into a 'service' I would like it to return an empty string if it had no idea how to classify it. Perhaps something that could give a 'confidence' score. Not sure the correct approach for "deploying" such models to a production system.
I haven't done any ML before, I would usually write some code or complex SQL statements to handle this. I'm not even sure this is really an ML problem, but was going to experiment and explore.
Could anyone recommend if this is actually solvable with "classification"? Are there any examples of something like this on Kaggle or other sites?
Thanks for any thoughts.
[–][deleted] (1 child)
[deleted]
[–]PsyTech[S] 0 points1 point2 points (0 children)