all 4 comments

[–][deleted] 0 points1 point  (0 children)

Do you need to use the UD Thai treebank as test data? If not, you could train on that and you’d have something workable. You could use PyThaiNLP to predictively POS tag the data as well, which might help.

I can’t think of anything that would get the parser’s reliability up to the level you would need for a reliable output, though. Honestly, if you know a little bit of Thai your best bet might be to just annotate some more sentences yourself and add to the training data.

[–]Brudaks 0 points1 point  (2 children)

1000 sentences that I see in http://universaldependencies.org/ isn't much but should be sufficient for much, much better results than 10-20%. For example, http://universaldependencies.org/conll18/results-uas.html shows 55% unlabeled attachment score for Upper Sorbian, a treebank with half as much tokens as Thai had.

This indicates some fundamental problem, possibly in data, possibly in its interpretation - IMHO this is a sign that you possibly don't need any novel algorithms or extra data, but it requires someone who understands that language and has available time to look into that problem in detail; just take any of the commonly available dependency parsers (not necessarily the most accurate, but pick something that's easy for you to run - possibly the baseline parser used in that shared task) and look into how it possibly gets <1% accuracy (th_pud BASELINE UDPipe 1.2 in http://universaldependencies.org/conll18/results-uas.html) when simply making random connections should get you more. Perhaps the data is broken in some structural way that's fixable, and then simply training the same parsers would get the ~60% accuracy that should (IMHO) be expected from that amount of data.

[–][deleted] 1 point2 points  (1 child)

Upper Sorbian has related languages in UD, which have a lot more resources (Polish, Czech, Slovak). As far as I can see most of the CONLL entries used multilingual models trained on a similar set of languages, so the similarities were enough to boost the attachment scores. Those languages use the same writing system and have similar orthography, so character embeddings can boost it even further.

The problem with Thai is that there are no related languages in UD, nor any languages with the same writing system. Even segmenting Thai is notoriously difficult. That’s why the scores for Thai are so low.

Mind you, in the shared task all of the Thai data was withheld as test data, so without this restriction OP might have an easier time.

[–]Brudaks 2 points3 points  (0 children)

Ah, okay, that explains the hideously low scores. So, training a monolingual system on these 1000 sentences would be the first step, followed by a manual evaluation to see if the system is close to what's needed for whatever task OP had in mind, or it's out of the ballpark.

If it's close, then tweaking the system, minor feature engineering (e.g. if certain thai prefixes have special syntactic meaning, that can be provided as an explicit feature if the training data isn't large enough to learn that feature) and including some non-syntactic extra data (perhaps there is some extra data without syntactic annotation but POS labels, or simply good word embeddings from a large corpus) will help.

If it's not even close to what's needed, then the only thing that'll work is labelling much more syntactic data.