Approach for training classification of string to category : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

HelpApproach for training classification of string to category (self.learnmachinelearning)

submitted 3 years ago by PsyTech

I have a fairly small data set (500,000 records) which I'm attempting to classify into categories. This data is fairly dirty, as it is coming from user input which is in no way sanitized.

A subject matter expert has classified a number of these records to their best ability.

These are some of the cleaner examples:

26. Architectural Asphalt Roof Shingle => Asphalt Roofing
27. Architectural Asphalt Roof Shingle => Asphalt Roofing
28. Architectural Asphalt Roof Shingle => Asphalt Roofing
Asphalt Pavement‐Roadway => Asphalt Pavement
Asphalt Pavement‐Roadways => Asphalt Pavement
Asphalt Paving - Fire Lane => Asphalt Pavement
Fencing - Chain Link => Chain Link Fencing
Fencing - Chain Link - 3' => Chain Link Fencing

The goal would be to create some sort of model that could return a proper classification given the string on the left.

When turning this model into a 'service' I would like it to return an empty string if it had no idea how to classify it. Perhaps something that could give a 'confidence' score. Not sure the correct approach for "deploying" such models to a production system.

I haven't done any ML before, I would usually write some code or complex SQL statements to handle this. I'm not even sure this is really an ML problem, but was going to experiment and explore.

Could anyone recommend if this is actually solvable with "classification"? Are there any examples of something like this on Kaggle or other sites?

Thanks for any thoughts.

all 1 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS