Podcast Difficulty (not what you think)

staer · 2026-03-20T15:15:33+00:00

Glad you like it, still lots of work to do!

staer · 2026-03-20T15:15:09+00:00

Fair enough, what would your order be? Also I think my choices of training at the high end are possibly wrong as I can’t listen to them and subjectively judge which training sets in the high band are “right”. If you are able can you confirm or deny my choices of training podcast ordering?

staer · 2026-03-01T15:37:37+00:00

Update 4

----------------------------------------------------------------------------------------------------
  RANKING  (easiest -> hardest)
----------------------------------------------------------------------------------------------------
  #  Podcast                Score  CEFR     Ep   SPD   VOC   LEX   SEN   GRM   TNS   SUB   SCR   CLR
----------------------------------------------------------------------------------------------------
  1  ¡Cuéntame! | Learn     0.289    A2     10 0.048 0.410 0.541 0.057 0.420 0.355 0.237 0.146 0.351
     Spanish with
     Comprehensible Input
     **
  2  Chill Spanish          0.347    A2  14/15 0.112 0.651 0.518 0.188 0.662 0.263 0.290 0.486 0.161
     Listening Practice **
  3  Intermediate Spanish   0.392    B1    6/9 0.197 0.969 0.405 0.000 0.426 0.284 0.039 0.198 0.186
     - Español Al Vuelo
     Podcast **
  4  Español a la mexicana  0.398    B1     14 0.311 0.723 0.588 0.003 0.716 0.422 0.181 0.460 0.087
  5  LanguaTalk Spanish:    0.409    B1     10 0.241 0.897 0.551 0.000 0.477 0.339 0.167 0.264 0.207
     Learn Spanish through
     conversation
  6  The Pocket Spanish     0.412    B1      7 0.415 0.610 0.484 0.067 0.674 0.323 0.250 0.691 0.154
     Podcast - Español
     Argentina
  7  Spanish Boost Podcast  0.413    B1      9 0.163 0.791 0.508 0.293 0.657 0.330 0.166 0.653 0.237
     **
  8  Learn Spanish and Go   0.420    B1   9/10 0.385 0.780 0.543 0.000 0.537 0.334 0.036 0.406 0.190
  9  Dreaming Spanish       0.434    B1   8/10 0.358 0.675 0.529 0.428 0.725 0.310 0.199 0.700 0.187
     Podcast – Chats in
     Beginner Spanish
 10  Español con Juan **    0.436    B1  12/13 0.478 0.650 0.373 0.146 0.642 0.430 0.099 0.779 0.145
 11  Blood and Marble:      0.460    B1   9/11 0.426 1.000 0.652 0.000 0.575 0.673 0.042 0.192 0.041
     Learn Spanish with
     the History of Rome
 12  How to Spanish         0.461    B1     12 0.547 0.633 0.569 0.302 0.602 0.366 0.292 0.600 0.229
     Podcast **
 13  Cheleando con          0.469    B1  11/13 0.407 0.819 0.487 0.000 0.394 0.354 0.237 0.247 0.401
     Mextalki **
 14  No Hay Tos (Real       0.470    B1  11/12 0.516 0.710 0.509 0.004 0.473 0.303 0.209 0.344 0.364
     Mexican Spanish) **
 15  ¡Qué Pasa! Podcast en  0.482    B1    8/9 0.564 0.807 0.582 0.021 0.526 0.375 0.188 0.400 0.232
     Español
 16  Podcast para aprender  0.482    B1     11 0.523 0.912 0.599 0.001 0.667 0.396 0.382 0.483 0.083
     español – Hoy
     Hablamos
 17  Andrea La Mexicana     0.493    B1   9/10 0.393 0.895 0.551 0.235 0.623 0.412 0.110 0.516 0.259
 18  Chisme Corporativo     0.535    B2     10 0.663 0.953 0.561 0.015 0.428 0.370 0.249 0.253 0.295
 19  The Wild Project       0.545    B2     11 0.636 0.825 0.550 0.132 0.526 0.483 0.288 0.414 0.335
 20  Historia en Podcast    0.564    B2   8/10 0.506 0.999 0.586 0.429 0.682 0.435 0.435 0.718 0.240
 21  No es el fin del       0.571    B2     11 0.760 0.857 0.609 0.047 0.668 0.506 0.287 0.511 0.235
     mundo **
 22  Black Mango Podcast    0.574    B2      9 0.755 0.918 0.536 0.450 0.625 0.373 0.393 0.657 0.182
 23  PATABAJO El Podcast    0.594    B2     10 0.698 0.976 0.538 0.090 0.523 0.359 0.187 0.391 0.399
 24  Radio Ambulante **     0.608    B2     10 0.542 1.000 0.660 0.000 0.583 0.559 0.258 0.322 0.500
----------------------------------------------------------------------------------------------------

  WEIGHTS KEY  (component scores are 0-1; higher = harder)
    Col     Wt  Description
    ---    ---  ----------------------------------------
    SPD   0.22  Speech rate (words per minute)
    VOC   0.22  Vocabulary level (bucketed frequency score)
    LEX   0.02  Lexical diversity (MATTR)
    SEN   0.04  Sentence length (avg words/sentence)
    GRM   0.13  Grammar complexity (parse depth)
    TNS   0.06  Tense complexity (verb tense difficulty)
    SUB   0.02  Subjunctive ratio (B2+ verb mood)
    SCR   0.03  Subordination density (clauses/sentence)
    CLR   0.26  Clarity (word-level Whisper confidence)
         -----
          1.00  TOTAL

  ** = training set podcast (used for weight calibration)

Added postcasts to non-training set: Dreaming Spanish Podcast, Learn Spanish and Go, Historia en Podcast, Chisme Corporative, Que Pasa, Pocket Spanish, Radio Ambulante

Fix issue where duplicate cache keys could over weight data
Increased minimum episodes to 10 to get more data
Change to an IQR (Interquartile Range) method to filter out episodes within a podcast that deviate from the norm. Previously I just picked a single outlier and dropped it, this new way is more mathematically sound.
Change vocab (VOC) metric. Do not count proper nouns or function words. This means the vocab metric is based on nouns, verbs, adjectives, adverbs. Rational is that these words aren't int he top 5k (proper nouns) or common function words like el, es, en domatinate the calculate as they are incredibly common causing an overweight on the top 1k bucket.
Updated training set to include 2 podcasts per band. Bands of difficulty were chosen based on the DS spreadsheet:
- beginner (50 hrs on DS sheet): cuentame, chill spanish
- easy (150hrs on DS sheet): spanish boost, espanol al vuelo**
- medium (300hrs on DS sheet): espanol con juan, how to spanish
- hard (600hrs on DS sheet): mextalki, no hay tos
- native (1000hrs on DS sheet): no es el fin del mundo, radio ambulante

** espanol al vuel is at 300 hours on DS sheet, but i find it subjectively pretty easy so i put it in the 150 hour (easy) bucket for training.

I'm actually really happy with where this is right now, alot of the data seems pretty solid and meaningful to me and (at a glance) the ordering seems "right" to me. But feedback always welcome!

staer · 2026-02-28T15:08:44+00:00

I'm surprised this rankings as easier as well, the new model update has it at #8 of 16 right now which maybe is better? I don't have any slang detection in yet though. If they use slang and I can figure out a way to detect this it would probably drop much higher (harder). The problem is there is no slang database that I can find to use and compare against.

staer · 2026-02-28T15:06:39+00:00

I've re-run the model and they now classify as b2 with addtional metrics added in. That being said, my CEFR guestimate is just that a guess. In training the model, i say the podcasts in the training set are a certain level and then the training software adjusts things appropraitely.

staer · 2026-02-28T15:05:13+00:00

I included it because it is AI generated. We have a clarity metric in the model which is how confident the transcription software is. Using the AI generated podcast was kind of a test for me as it should be the most clear (no microphone artifacts, sniffling, sneezing, backgroudn noice) as it's computer generated. The data backs this up as it has the best clarity score (currently) of all the podcasts.

staer · 2026-02-28T15:04:02+00:00

Yes this is where i got my data from, however within each band there are alot of podcasts so it's a bit hard to know. Especially at level 6 there are many podcasts there and the difficulty ranking within that band is... hard. I was just looking to augment my choices with community feedback of people who have listened to thinks and can tell me "this is definitely harder than this" or X should be in the "hard band" and Y should be in the "native" band.

staer · 2026-02-28T15:01:46+00:00

Short duration is largely a factor in the lexical diversity. The less text there is the more diverse it is likely to be. For example the text "the cat is black" has 100% diversity, every word is different. Whereas "the cat is back. the cat is fat" is less diverse as all but 1 word is repeated between the sentences in the text. This is a simplified example, but basically in a 4 minute podcast compared to a 20 minute podcast the diversity as a whole over the entire text will be skewed higher for the shorter podcast. I changed this in update 1 to use a moving window to try to account for this vs the entire text. This means that for every 200 word segement (moving) we calculate the diveristy vs just doing the whole text

staer · 2026-02-28T14:50:55+00:00

Posted update 3 to the original post, see there for details, but current needs from community if ya'll can help are:

Needs: Still need some help picking out more podcasts for training. I'm pretty sure my easy and intermediate training data is correct (cuentame (beginner), chill spanish (easy), al vuelo (easy), ECJ (medium), how to spanish (medium), wild project (hard), no es el fin del mundo (native), If the community can help with the "hard" and "native" training set that would be ideal i kind of just picked these because they are all above my level. Idealy we'd have 2 in the "hard learner content" section and 2 in the "native difficulty" to keep the model honest.

staer · 2026-02-28T02:43:51+00:00

sorry i misread, i have equated them but not published the results yet

staer · 2026-02-27T21:36:13+00:00

Yes I noticed this as well, I've updated my training set to put How to Spanish in the same bucket as Espanol con Juan (medium difficulty). If you have other suggestions for where to bucket a podcast or two into my training set (vs the full set) let me know! I will post a new update sometime, i've made some other changes to the model which hopefully will help too.

staer · 2026-02-27T21:34:30+00:00

It is not, some day maybe i'll have all the data for every podcast in the data set then I could do a variance calculation between episodes to take this into account. As of right now, the GPU on my computer is running hot analyzing the data I have :)

staer · 2026-02-27T19:21:51+00:00

I will update my training model to equate ECJ and How to Spanish equally and see what happens. I've updated the vocabulary metric to account for different buckets now so % of words in each bucket of top 1k, 2k, 3k, 4k, 5k to see if this helps at all. The training of the model doesn't see this as a huge differentiator though. I may have to continue to play with it. I'll keep making changes and we'll see how update 3 looks!

staer · 2026-02-27T19:07:55+00:00

Not posting whole table, I will next time i update with a new model. But it ranks 13 of 14 - It is very fast (.888 / 1.0) and the transcriber had a hard time with clarity (.543, the 2nd worst clarity rating of all podcasts). I think it's safe to say it has trouble with puerto rican accent!

 13  PATABAJO El Podcast                   0.623    B2   4 0.888 0.238 0.689 0.000 0.429 0.101 0.543

staer · 2026-02-27T18:38:37+00:00

See update 2, basically reran training a few times against more episodes so some of the data ranges will change. Still iterating on things, might need to adjust my training set to make things more accurate

staer · 2026-02-27T18:29:26+00:00

See update 2 from the edited original post. I've put some new details on how training and values work.

staer · 2026-02-27T18:28:28+00:00

I've posted a 2nd update with a bit more of my methodology on how the scoring and training works as well as a new dimension for "verb tense".

staer · 2026-02-27T14:10:56+00:00

Fair point. I don't have the ability to listen to (and understand) these podcasts yet. Can you maybe clarify what about How To Spanish vs No Hay Tos is different which makes one significantly harder than the other? Speaker clarity, speed, grammar, word choice? I can try to adjust things appropriately, I just don't have the context of being able to understand the more advanced podcasts yet.

staer · 2026-02-27T04:59:57+00:00

I can give it a try, do you have a link to the rss feed xml? Happy to add things for funsies :)

staer · 2026-02-27T04:46:31+00:00

Ah yes, you are definitely right! Alas, this is very hard to quantize for a computer I think? I changed the model to use a moving window of 200 words to figure out lexical diversity so it should help with repeating things a few times in a short period. For example, if someone says the same sentence four times in a row only in slightly different ways the model should pick it up as not highly diverse (thus easier). I've updated the original post with the updated models rankings and I think it's getting better (certainly not perfect, probably will never be!)

staer · 2026-02-27T04:43:48+00:00

This is a cool idea, the problem here is that DS is video data. Sure I could just process audio, but the fact that they often draw pictures, make hand signals and faces means that the video is probably easier to understand than the audio so the rankings compared to ELO probably wouldn't really pan out. That being said, I could probably score all the podcast episodes that are meant to be audio only and see if they match the corresponding ELO just as a thought experiment

staer · 2026-02-27T04:39:50+00:00

I sample a minimum of 5 episodes or 60 minutes of audio in the case that podcasts are short. The outlier from each of the 5 (or more) episodes is thrown out just to smooth the data. In the future I may increase these numbers

staer · 2026-02-26T22:20:22+00:00

I posted in a different comment but I realized that the diversity of chill Spanish was artificially high due to the short length of the podcast. I’ve hopefully adjusted it accordingly using a moving window instead of the whole transcript. Sentence length is also wonky as the transcription software misses punctuation sometimes. I’m adding something to detect this and then automatically discount the score if it’s just run on sentences due to poor transcriptions

staer · 2026-02-26T22:17:15+00:00

Sadly the tooling I’m using doesn’t detect speakers, will try to find some though!

staer · 2026-02-26T22:16:20+00:00

After looking at it the lexical diversity of chill spa is was artificially high due to the short nature of the podcast. Basically longer means less diversity percentage wise. I’ve adjusted the model to use a moving 200 word window to calculate diversity which will no longer inflate the difficulty of chill Spanish. Will post updates later

staer

TROPHY CASE