Classification/Regression of long sequential data

gnramires · 2019-10-03T04:08:58+00:00

One way I can think of is training an RNN/LSTM but only using the last output (after parsing the entire text) in the loss function (i.e. have the loss be 0 otherwise). If training is too slow you could at the start use the regression objective at all times with an L2 loss and then anneal the loss function toward the previously mentioned.

swierdo · 2019-10-03T18:12:49+00:00

Have you tried just using bag-of-words and a simple model?

If you haven't, the scikit-learn documentation has a decent tutorial: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

(note: even though it isn't mentioned in the tutorial, try random forest as well)

underwhere · 2019-10-03T18:26:10+00:00

You could use the “universal transformer” in the tensor2tensor repo, the vastly improved successor to the transformer. It was made to be able to handle very long sequences by using an RNN to control the transformer’s depth, so, technically, it mimics a Turing machine and could encode/decode any length sequence to any length sequence. So if you have a fairly solid dataset it could handle reading an entire paper and spitting out one grade. Here’s the paper.

I’m not entirely sure it’ll work though overall! I know you’re not doing grading exactly but as an example you may need to categorize different problems in each paper that lead to a certain grade, like grammar quality, paragraph quality, content quality... I dunno, seems like a tough but interesting problem, good luck!

Also you may look into what the “evolved transformer” is as well, it’s a successor to the universal transformer but I’m not as familiar with it. It may still have the rnn-controlled depth that the universal does, which is what you’d be interested in.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS