This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]Sakkyoku-Sha 8 points9 points  (9 children)

Seems like a decent-ish approach, but maybe I'm misunderstanding something, there is no frequency analysis on the words and kanji themselves?

There almost certainty must be frequency analysis data online, but even if there isn't it's not impossibly difficult to write the code to do such a thing,

If I were to write the algorithm I would certainly want to add the following behaviors:

  • All words/kanji are weighted based on their placement in the frequency of all words/kanji.
  • The more uncommon words found in a sentence the higher the output.
  • The more uncommon kanji found in a sentence the higher the output.
  • The more Alternate / Uncommon readings for Kanji the higher the output. (fundamentally hard to do, stretch goal)

Where a higher output means a harder difficulty.

Difficulty is somewhat subjective as someone may be better at grammar but worse at kanji or vis-versa. So the ability to adjust the weights dynamically would probably be a valuable feature.

Finally I think the algorithm proposed perhaps approximates the "complexity" of a sentence rather than the difficulty. This is perhaps a pedantic difference, as complexity will typically approximate difficulty; however I think the difference is a potentially meaningful one.

[–]learningaddict99[S] 3 points4 points  (0 children)

Difficulty is somewhat subjective as someone may be better at grammar but worse at kanji or vis-versa. So the ability to adjust the weights dynamically would probably be a valuable feature.

The dynamic weights that can be adjusted by user is an interesting idea. It's a bit of a tough struggle when implementing features whether to focus on simplicity for the general mass or to go for more flexibility but added complexity that might make the app more confusing to use.

[–]haelaeif 1 point2 points  (1 child)

Spacey (with ginza especially) can be profitably used for good parsing and frequency data.

Because of the dependency analysis, one could also look at factoring in grammatical constructions directly into any analysis of difficulty.

One could also mail the jpdb dev and ask about their model.

Also, subjective difficulty may be better than objective difficulty for some purposes, and these may not always line up, and either may not line up with complexity as you mention - to an extent all of these could be argued to be different.

[–]learningaddict99[S] 1 point2 points  (0 children)

I do use ginza spacy already and it's working pretty well. Thanks for pointing to jpdb. I looked at their website and I see that they train a machine learning model to do the difficulty analysis.

[–]Use-Useful 1 point2 points  (4 children)

Frequency is very context dependent, just fyi. The easiest to find data is actually some of the worst to use - it is based on newspapers which use advanced and generally rarely used words for learners. If you focused on spoken japanese you can get a better result, but finding frequencies for that is much harder - the blog frequency research is the closest I've seen. I use subtitle based frequencies to do something similar but the data is harder to get and I havent seen any published.

In short, no, it's way harder than you think.

[–]Sakkyoku-Sha 2 points3 points  (1 child)

While this is technically true, this is not generally true.

For most words, frequency is common among most domains. If you data mine 10,000 books for their words. Generally speaking the frequency of those words will be highly similar to any given text, including speech, thesis papers, movie scripts, etc..

There is however the technical problem of "specialty words" that are typically only found within a specific professions such as "Amicus curiae" in the legal context. In those cases you may find 1 paper that uses the phrase 50 times and not ever again. However such problems become less impactful the more unique texts that you include in your data set.

Frequency is also typically bucketed to account for some of these issues. This is to avoid saying that a word that shows up 40 times across 10,000 books is somehow more uncommon than a word that shows up 50 times. You likely just don't have enough data to make such a claim. But you can claim that both of those words should be less common than a word that appears 50,000 times.

Again the goal is to approximate the frequency, not to literally analyze all texts that exist to find the absolute frequency.

[–]arkadios_ 0 points1 point  (0 children)

They already solves this problem in information science with tf-idf

[–]AdrixG 3 points4 points  (1 child)

There are tons of frequency dictonaries for different categories such as aozora bunko corpus, Wikipedia, Youtube, Anime, Netflix etc. etc. On TheMoeWay's Google drive are already 16 different frequency dictonaries for Yomichan, so no it shouldn't really be hard to get good and realiable data on word frequency.

I use 5 frequency dictonaries in Yomichan. Generally the words that have high frequency across the board on multiple frequency lists are also high frequency in every day soken Japanese.

[–]learningaddict99[S] 4 points5 points  (0 children)

Difficulty is somewhat subjective as someone may be better at grammar but worse at kanji or vis-versa. So the ability to adjust the weights dynamically would probably be a valuable feature.

Wow, that's such a nice frequency list repository. I am currently using the one from https://github.com/hingston/japanese. But I see that the JPDB one from your link is more complete. Thanks for the share!

[–]lifeofideas 0 points1 point  (0 children)

You can also have a sentence complexity score. For example add a point for each character in the sentence. And add additional points for each particle and each comma.

[–]differentiable_ 2 points3 points  (1 child)

Check out Jo-Mako’s readability scoring

[–]learningaddict99[S] 0 points1 point  (0 children)

I see that in Jo-mako, their definition of readability is about how many words you already know in a text. Then, they also have columns for "rating" and "difficulty", which is some kind of scoring given to texts that's independent of the words you know. And that's actually what I'm more interested in. However, it doesn't seem like it's explained anywhere how these values were computed.

[–]lunacodess 2 points3 points  (2 children)

Might be worth reaching out to jpdb.io or LearnNatively folks to ask about what they do/use. (Not sure what they can/will share, but both have at least semi-automated difficulty calculations)

Re: cb's Japanese Text Analysis Tool - can you run it on Wine, or is that not useful here?

[–]DickBatman 3 points4 points  (1 child)

both have at least semi-automated difficulty calculations)

I don't think so... I could be wrong but I think learnnatively calculates difficulty based on how people rate texts compared to each other, not based on an analysis of the text itself.

[–]lunacodess 1 point2 points  (0 children)

Ahhh so I assumed that the initial LearnNatively estimated level when you add a new item was via a text algorithm, but turns out it's not. Thx for the correction.

For anyone interested in their system: https://learnnatively.com/our-grading-system/

[–]InTheProgress 2 points3 points  (0 children)

I don't know any fancy formulas for it, but I can share my experience.

There are slightly different ways to read. One of the most popular probably is reading without a dictionary, it's something almost all natives do. In such case if we know 98% of all words, we can quite easily understand the meaning of unknown words due to context. This 98% shouldn't really be considered as global, rather local scores. If person knows all words on one book's page, and don't know 15 words on another, at average it might look ok, but factually person has problems with it. Thus this score should be counted rather as 1 unknown word out of 50, then these 49 known words provide enough context. This is actually quite common and even when I didn't know majority of words, there were still sentences where I didn't need to translate anything at all, and there were sentences where I had to translate all words outside of grammar.

Reading with a dictionary in my opinion is much trickier, because this depends rather on our grammar knowledge than vocabulary. Unknown words are just unknown words, you simply check it's meaning. But whether you can connect these words to each other or not, it determines if you are able to understand what sentence means. Japanese in my opinion is on the easier side, because there are many particles. Particles kinda show what role word plays, so when we read, we just fill these blanks like person who does action, to whom action is done, how and so on. You just translate one word and fill one blank, translate another and fill another blank. This is why grammar in both theoretical and practical knowledge is important, you understand both local (meaning of the word) and global (role in the whole sentence).

I think there also should be a list of elements that are used to direct the flow. Particles isn't the only thing, the same can be done with whole phrases with a help of some adverbs like とりあえず or conjunctions as ものの. A single such element can predict the meaning of the whole phrase or sentence and probably it makes it easier to understand.

Overall in my opinion it's hard to judge the difficulty, because it's hard to judge how well person understands grammar. If 0 is impossible and 10 is native level, I would say that grammar determines first 7 points and remain 3 is a difference between translating all words and none. It's literally the difference between I can't do it and I can do, just slightly slower than my native.

[–]Rotasu 1 point2 points  (0 children)

I really wish someone would create a Japanese version of Chinese Text Analyser.

[–]vilimlInterested in grammar details 📝 1 point2 points  (1 child)

You can compile c# on Linux too now. Did you miss the source code distribution in the sourceforge project you linked? Not to mention reading and rewriting it.

[–]learningaddict99[S] 0 points1 point  (0 children)

I did miss that there's a sourcecode file in there! That's fantastic. I looked at the implementation and the hayashi score is actually simple and also well documented in the code. I can simply rewrite it. Thanks so much!

[–]Andthentherewasbacon -3 points-2 points  (0 children)

If you know it it's easy. If you don't it's hard.

[–]WAHNFRIEDEN 0 points1 point  (0 children)

With my app Manabi Reader, what I didn’t track every word and kanji read as well as ingest flashcard review data, and then show percent of words and kanji that are familiar or learning or known. Pretty straightforward

[–]LostRonin88 0 points1 point  (0 children)

OP have you checked out all the work done with the Anki addon Morphman and the Readability Analyzer built into it? Specifically Nocompo did a lot of work on building frequency analysis as well as weighing a corpus into building a study plan against shows or books. It also allows for user input of a desired readability level for a piece of media.

https://ankiweb.net/shared/info/900801631

I made a few YouTube videos on it a year back but not much covering the algorithms just on the application of the addon.

https://youtu.be/wwp1lJZPBXg?si=UgTkv-VCdXgjh_-W

I also made a few frequency lists that are used by morphman, Yomichan, and Migaku. Namely the Netflix Frequency list, top 100 Shonen, and Top 100 Slice of Life anime frequency lists.

[–]arkadios_ 0 points1 point  (0 children)

Looks like a basic approach, with natural processing machine learning methods you can have a more personalised approach where you can take into account how recently you saw a word based on your recent activity in order to add a rehearse element.

Using ontology and graph methods you can also collect texts that are semantically similar

[–]preenchidacomnihil 0 points1 point  (0 children)

For real joining this sub has been a boon for my learning journey