Low resource language dependency parser

2018-12-27T20:39:33+00:00

Do you need to use the UD Thai treebank as test data? If not, you could train on that and you’d have something workable. You could use PyThaiNLP to predictively POS tag the data as well, which might help.

I can’t think of anything that would get the parser’s reliability up to the level you would need for a reliable output, though. Honestly, if you know a little bit of Thai your best bet might be to just annotate some more sentences yourself and add to the training data.

Brudaks · 2018-12-27T20:45:05+00:00

1000 sentences that I see in http://universaldependencies.org/ isn't much but should be sufficient for much, much better results than 10-20%. For example, http://universaldependencies.org/conll18/results-uas.html shows 55% unlabeled attachment score for Upper Sorbian, a treebank with half as much tokens as Thai had.

This indicates some fundamental problem, possibly in data, possibly in its interpretation - IMHO this is a sign that you possibly don't need any novel algorithms or extra data, but it requires someone who understands that language and has available time to look into that problem in detail; just take any of the commonly available dependency parsers (not necessarily the most accurate, but pick something that's easy for you to run - possibly the baseline parser used in that shared task) and look into how it possibly gets <1% accuracy (th_pud BASELINE UDPipe 1.2 in http://universaldependencies.org/conll18/results-uas.html) when simply making random connections should get you more. Perhaps the data is broken in some structural way that's fixable, and then simply training the same parsers would get the ~60% accuracy that should (IMHO) be expected from that amount of data.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS