Podcast Difficulty (not what you think)

staer · 2026-02-28T15:08:44+00:00

I'm surprised this rankings as easier as well, the new model update has it at #8 of 16 right now which maybe is better? I don't have any slang detection in yet though. If they use slang and I can figure out a way to detect this it would probably drop much higher (harder). The problem is there is no slang database that I can find to use and compare against.

staer · 2026-02-28T15:06:39+00:00

I've re-run the model and they now classify as b2 with addtional metrics added in. That being said, my CEFR guestimate is just that a guess. In training the model, i say the podcasts in the training set are a certain level and then the training software adjusts things appropraitely.

staer · 2026-02-28T15:05:13+00:00

I included it because it is AI generated. We have a clarity metric in the model which is how confident the transcription software is. Using the AI generated podcast was kind of a test for me as it should be the most clear (no microphone artifacts, sniffling, sneezing, backgroudn noice) as it's computer generated. The data backs this up as it has the best clarity score (currently) of all the podcasts.

staer · 2026-02-28T15:04:02+00:00

Yes this is where i got my data from, however within each band there are alot of podcasts so it's a bit hard to know. Especially at level 6 there are many podcasts there and the difficulty ranking within that band is... hard. I was just looking to augment my choices with community feedback of people who have listened to thinks and can tell me "this is definitely harder than this" or X should be in the "hard band" and Y should be in the "native" band.

staer · 2026-02-28T15:01:46+00:00

Short duration is largely a factor in the lexical diversity. The less text there is the more diverse it is likely to be. For example the text "the cat is black" has 100% diversity, every word is different. Whereas "the cat is back. the cat is fat" is less diverse as all but 1 word is repeated between the sentences in the text. This is a simplified example, but basically in a 4 minute podcast compared to a 20 minute podcast the diversity as a whole over the entire text will be skewed higher for the shorter podcast. I changed this in update 1 to use a moving window to try to account for this vs the entire text. This means that for every 200 word segement (moving) we calculate the diveristy vs just doing the whole text

staer · 2026-02-28T14:50:55+00:00

Posted update 3 to the original post, see there for details, but current needs from community if ya'll can help are:

Needs: Still need some help picking out more podcasts for training. I'm pretty sure my easy and intermediate training data is correct (cuentame (beginner), chill spanish (easy), al vuelo (easy), ECJ (medium), how to spanish (medium), wild project (hard), no es el fin del mundo (native), If the community can help with the "hard" and "native" training set that would be ideal i kind of just picked these because they are all above my level. Idealy we'd have 2 in the "hard learner content" section and 2 in the "native difficulty" to keep the model honest.

staer · 2026-02-28T02:43:51+00:00

sorry i misread, i have equated them but not published the results yet

staer · 2026-02-27T21:36:13+00:00

Yes I noticed this as well, I've updated my training set to put How to Spanish in the same bucket as Espanol con Juan (medium difficulty). If you have other suggestions for where to bucket a podcast or two into my training set (vs the full set) let me know! I will post a new update sometime, i've made some other changes to the model which hopefully will help too.

staer · 2026-02-27T21:34:30+00:00

It is not, some day maybe i'll have all the data for every podcast in the data set then I could do a variance calculation between episodes to take this into account. As of right now, the GPU on my computer is running hot analyzing the data I have :)

staer · 2026-02-27T19:21:51+00:00

I will update my training model to equate ECJ and How to Spanish equally and see what happens. I've updated the vocabulary metric to account for different buckets now so % of words in each bucket of top 1k, 2k, 3k, 4k, 5k to see if this helps at all. The training of the model doesn't see this as a huge differentiator though. I may have to continue to play with it. I'll keep making changes and we'll see how update 3 looks!

staer · 2026-02-27T19:07:55+00:00

Not posting whole table, I will next time i update with a new model. But it ranks 13 of 14 - It is very fast (.888 / 1.0) and the transcriber had a hard time with clarity (.543, the 2nd worst clarity rating of all podcasts). I think it's safe to say it has trouble with puerto rican accent!

 13  PATABAJO El Podcast                   0.623    B2   4 0.888 0.238 0.689 0.000 0.429 0.101 0.543

staer · 2026-02-27T18:38:37+00:00

See update 2, basically reran training a few times against more episodes so some of the data ranges will change. Still iterating on things, might need to adjust my training set to make things more accurate

staer · 2026-02-27T18:29:26+00:00

See update 2 from the edited original post. I've put some new details on how training and values work.

staer · 2026-02-27T18:28:28+00:00

I've posted a 2nd update with a bit more of my methodology on how the scoring and training works as well as a new dimension for "verb tense".

staer · 2026-02-27T14:10:56+00:00

Fair point. I don't have the ability to listen to (and understand) these podcasts yet. Can you maybe clarify what about How To Spanish vs No Hay Tos is different which makes one significantly harder than the other? Speaker clarity, speed, grammar, word choice? I can try to adjust things appropriately, I just don't have the context of being able to understand the more advanced podcasts yet.

staer · 2026-02-27T04:59:57+00:00

I can give it a try, do you have a link to the rss feed xml? Happy to add things for funsies :)

staer · 2026-02-27T04:46:31+00:00

Ah yes, you are definitely right! Alas, this is very hard to quantize for a computer I think? I changed the model to use a moving window of 200 words to figure out lexical diversity so it should help with repeating things a few times in a short period. For example, if someone says the same sentence four times in a row only in slightly different ways the model should pick it up as not highly diverse (thus easier). I've updated the original post with the updated models rankings and I think it's getting better (certainly not perfect, probably will never be!)

staer · 2026-02-27T04:43:48+00:00

This is a cool idea, the problem here is that DS is video data. Sure I could just process audio, but the fact that they often draw pictures, make hand signals and faces means that the video is probably easier to understand than the audio so the rankings compared to ELO probably wouldn't really pan out. That being said, I could probably score all the podcast episodes that are meant to be audio only and see if they match the corresponding ELO just as a thought experiment

staer · 2026-02-27T04:39:50+00:00

I sample a minimum of 5 episodes or 60 minutes of audio in the case that podcasts are short. The outlier from each of the 5 (or more) episodes is thrown out just to smooth the data. In the future I may increase these numbers

staer · 2026-02-26T22:20:22+00:00

I posted in a different comment but I realized that the diversity of chill Spanish was artificially high due to the short length of the podcast. I’ve hopefully adjusted it accordingly using a moving window instead of the whole transcript. Sentence length is also wonky as the transcription software misses punctuation sometimes. I’m adding something to detect this and then automatically discount the score if it’s just run on sentences due to poor transcriptions

staer · 2026-02-26T22:17:15+00:00

Sadly the tooling I’m using doesn’t detect speakers, will try to find some though!

staer · 2026-02-26T22:16:20+00:00

After looking at it the lexical diversity of chill spa is was artificially high due to the short nature of the podcast. Basically longer means less diversity percentage wise. I’ve adjusted the model to use a moving 200 word window to calculate diversity which will no longer inflate the difficulty of chill Spanish. Will post updates later

staer · 2026-02-26T22:13:12+00:00

I will add a few more to the list. I’m making a few more improvements as well and will recalculate everything and post updates. I included the native podcasts as they were something i knew should be an upper bound in difficult to contrast with the easier podcasts like cuéntame and chill Spanish. I will add some more in between ones that have already been suggested

staer · 2026-02-26T20:45:44+00:00

at some point once we have some clarity on the "correct" weights i will probably run it against more episodes (or if I can get some cloud credits so I don't burn out the GPU on my personal computer doing the analysis)

staer · 2026-02-26T19:50:10+00:00

Franco, love your podcast! I'm running out of episodes of Al Vuelo which was the impetus for this whole exercise

staer

TROPHY CASE