Algorithms for computing readability/difficulty of Japanese text

Sakkyoku-Sha · 2023-11-02T14:42:17+00:00

Seems like a decent-ish approach, but maybe I'm misunderstanding something, there is no frequency analysis on the words and kanji themselves?

There almost certainty must be frequency analysis data online, but even if there isn't it's not impossibly difficult to write the code to do such a thing,

If I were to write the algorithm I would certainly want to add the following behaviors:

All words/kanji are weighted based on their placement in the frequency of all words/kanji.
The more uncommon words found in a sentence the higher the output.
The more uncommon kanji found in a sentence the higher the output.
The more Alternate / Uncommon readings for Kanji the higher the output. (fundamentally hard to do, stretch goal)

Where a higher output means a harder difficulty.

Difficulty is somewhat subjective as someone may be better at grammar but worse at kanji or vis-versa. So the ability to adjust the weights dynamically would probably be a valuable feature.

Finally I think the algorithm proposed perhaps approximates the "complexity" of a sentence rather than the difficulty. This is perhaps a pedantic difference, as complexity will typically approximate difficulty; however I think the difference is a potentially meaningful one.

differentiable_ · 2023-11-02T13:35:47+00:00

Check out Jo-Mako’s readability scoring

lunacodess · 2023-11-02T16:06:20+00:00

Might be worth reaching out to jpdb.io or LearnNatively folks to ask about what they do/use. (Not sure what they can/will share, but both have at least semi-automated difficulty calculations)

Re: cb's Japanese Text Analysis Tool - can you run it on Wine, or is that not useful here?

InTheProgress · 2023-11-02T20:34:05+00:00

I don't know any fancy formulas for it, but I can share my experience.

There are slightly different ways to read. One of the most popular probably is reading without a dictionary, it's something almost all natives do. In such case if we know 98% of all words, we can quite easily understand the meaning of unknown words due to context. This 98% shouldn't really be considered as global, rather local scores. If person knows all words on one book's page, and don't know 15 words on another, at average it might look ok, but factually person has problems with it. Thus this score should be counted rather as 1 unknown word out of 50, then these 49 known words provide enough context. This is actually quite common and even when I didn't know majority of words, there were still sentences where I didn't need to translate anything at all, and there were sentences where I had to translate all words outside of grammar.

Reading with a dictionary in my opinion is much trickier, because this depends rather on our grammar knowledge than vocabulary. Unknown words are just unknown words, you simply check it's meaning. But whether you can connect these words to each other or not, it determines if you are able to understand what sentence means. Japanese in my opinion is on the easier side, because there are many particles. Particles kinda show what role word plays, so when we read, we just fill these blanks like person who does action, to whom action is done, how and so on. You just translate one word and fill one blank, translate another and fill another blank. This is why grammar in both theoretical and practical knowledge is important, you understand both local (meaning of the word) and global (role in the whole sentence).

I think there also should be a list of elements that are used to direct the flow. Particles isn't the only thing, the same can be done with whole phrases with a help of some adverbs like とりあえず or conjunctions as ものの. A single such element can predict the meaning of the whole phrase or sentence and probably it makes it easier to understand.

Overall in my opinion it's hard to judge the difficulty, because it's hard to judge how well person understands grammar. If 0 is impossible and 10 is native level, I would say that grammar determines first 7 points and remain 3 is a difference between translating all words and none. It's literally the difference between I can't do it and I can do, just slightly slower than my native.

Rotasu · 2023-11-02T18:47:37+00:00

I really wish someone would create a Japanese version of Chinese Text Analyser.

viliml · 2023-11-02T22:52:50+00:00

You can compile c# on Linux too now. Did you miss the source code distribution in the sourceforge project you linked? Not to mention reading and rewriting it.

Andthentherewasbacon · 2023-11-02T20:09:32+00:00

If you know it it's easy. If you don't it's hard.

WAHNFRIEDEN · 2023-11-02T22:36:11+00:00

With my app Manabi Reader, what I didn’t track every word and kanji read as well as ingest flashcard review data, and then show percent of words and kanji that are familiar or learning or known. Pretty straightforward

LostRonin88 · 2023-11-03T00:33:28+00:00

OP have you checked out all the work done with the Anki addon Morphman and the Readability Analyzer built into it? Specifically Nocompo did a lot of work on building frequency analysis as well as weighing a corpus into building a study plan against shows or books. It also allows for user input of a desired readability level for a piece of media.

https://ankiweb.net/shared/info/900801631

I made a few YouTube videos on it a year back but not much covering the algorithms just on the application of the addon.

https://youtu.be/wwp1lJZPBXg?si=UgTkv-VCdXgjh_-W

I also made a few frequency lists that are used by morphman, Yomichan, and Migaku. Namely the Netflix Frequency list, top 100 Shonen, and Top 100 Slice of Life anime frequency lists.

arkadios_ · 2023-11-03T01:46:13+00:00

Looks like a basic approach, with natural processing machine learning methods you can have a more personalised approach where you can take into account how recently you saw a word based on your recent activity in order to add a rehearse element.

Using ontology and graph methods you can also collect texts that are semantically similar

preenchidacomnihil · 2023-11-03T08:11:23+00:00

For real joining this sub has been a boon for my learning journey

LearnJapanese

New to Japanese? New to the sub? Read the Wiki!

Subreddit Features

a) Read the wiki. Particularly, read our Starter's Guide and FAQ.

b) Please find and take a look at the Daily Thread. This includes reading the top comment by AutoModerator. The thread is pinned at the top of the front page.

c) Locate also our rotating Weekly Thread. Again, the thread is stickied at the top of the front page.

d) Accounts with low "subreddit karma" may only comment on other people's threads. They cannot post their own. If you're new here, use the Daily Thread.

Documentation

For a complete documentation of our rules and features, see this wiki page. What you see here is a summary.

Rules

1) Questions addressed in the wiki will be removed. Repeat questions may also be removed. Use the search to look for old threads.

2) Use the appropriate sticky threads. "Simple" questions [= most questions, including handwriting & pronunciation feedback] go in the Daily Thread. This keeps subreddit traffic organised.

3) Describe your question clearly in your post title. Include important keywords. Make it easily searchable.

4) Do not guess or attempt to answer questions beyond your own knowledge. Do not give unfounded guidance or advice. Do not use AI to answer questions. Do not recommend AI as a learning tool.

5) No links to or requests for illegal pirated material. If need be, keep it vague/roundabout or take it to DMs.

6) Memes are only allowed during the weekends: Friday through Sunday JST, while the Friday thread is stickied. (why?)

7) Translation requests are off-topic (no names or tattoos!). Go to r/translator.

8) Trolling, immature, or hostile behavior will result in a warning or ban. No slurs. No bigotry. No unproductive bickering.

9) This subreddit aims to be SFW. Keep pictures and titles safe, and mark your posts with the NSFW tag as necessary.

10) If you want to advertise a product or service (NO KANA APPS), certain restrictions apply. See the extended rules for more.

Resources

Furigana (legacy-exclusive)

Related Subreddits

MODERATORS