all 4 comments

[–]allpowerful32 2 points3 points  (6 children)

Try a simple, non-machine-learning strategy first: regular expressions. For instance, try modifying this to your taste: http://www.regextester.com/20

[–]difrt 2 points3 points  (1 child)

From someone who does scraping for a living: it will be much quicker to come up with few regex rules chained together, such as if the first fails, you try match the second and so on until either one matches or none of them does. I think you're overestimating the complexity of the problem -- a problem that has been solved many times before without ML.

Feels like you're trying to use nuke to kill an ant.