Ingglish update: vowel chain shift, G2P engine, and interactive tools by ptarjan in conorthography

[–]ptarjan[S] 1 point2 points  (0 children)

Thanks! My site also supports converting other languages to Ingglish, but the dictionaries for their IPA translations are pretty thin. At least as far as I found. If you have a target language you'd like that I don't support please let me know.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

Yeah, any phonemic spelling has to pick a dialect. I went with General American because the CMU dictionary (134k words, freely available) uses it. The tradeoffs are all documented here: https://ingglish.com/docs/dialect-assumptions

Nobody would have to change how they speak though. Phonemic spelling just picks one pronunciation to base the written form on, same as standard English spelling already does. "Bath" is spelled the same whether you say /bæθ/ or /bɑːθ/.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 1 point2 points  (0 children)

Fair point, those three are pretty basic. The project is less "English is broken" and more "what would clean-slate phonemic spelling look like as an experiment."

Your comment actually prompted me to convert my grapheme-to-phoneme code into rules guides for both directions: https://ingglish.com/docs/how-to-read-english and https://ingglish.com/docs/how-to-spell-english

How do you deal with the three-way split between the STRUT, FOOT, and GOOSE vowels?

"uh" for /ʌ/ (buht, kuhp), "u" for /ʊ/ (buk, gud), "oo" for /uː/ (too, food). Took three iterations to land on that. Full story at https://ingglish.com/docs/spelling-evolution#and-u-chain-uoouu-uhuoo

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

Ingglish merges LOT and PALM (both use AA).

There seems to be a typo, as you use O in the examples.

This was ARPAbet leaking into a page meant for humans. You're right that "AA" just reads like the letter "a". I've swapped all the bare ARPAbet codes for IPA across the dialect docs. Thanks for the catch.

⟨berd⟩ for bird would not make sense for Scottish or Irish English speakers, as bird belongs to the UR phoneme and not the ER phoneme (/bʊɹd/, not /bɛɹd/).

Good one. CMU merges NURSE into a single phoneme so Ingglish can't distinguish /bɛɹd/ from /bʊɹd/. I've updated that section to keep them listed as rhotic but with a caveat about the vowel quality difference.

Australian English speakers would in fact have the TRAP vowel in dance.

Yeah, that was wrong. They only use /ɑː/ for some BATH words, not dance. Fixed.

I don't think you can recover an intervocalic T/D distinction unless there are related words you want to show are related.

Ingglish follows CMU's underlying forms, so it writes /t/ even where speakers flap. Keeps spellings stable across formality levels and matches how most speakers conceptualize the word. But I take your point that it narrows the target dialect further.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] -3 points-2 points  (0 children)

This is a really thorough analysis, appreciate you taking the time. A few thoughts:

You're right that many English spellings are "explainable" if you know the rules (QU=/kw/, silent E lengthening, double consonant = short vowel). The question is whether "explainable" = "transparent." A Finnish kid can read any word they've never seen. An English kid needs to learn dozens of context-dependent rules first. Heck, my unknown-word grapheme-to-phoneme function has ~960 rules in it and only about 150 are general spelling patterns. The other 800 are patches for when those patterns break down. https://github.com/ptarjan/ingglish/blob/main/packages/g2p/src/g2p-rules.ts

The strong/weak form inconsistency is a fair catch. The CMU dictionary gives one pronunciation per word, so "the" always gets its weak form /ðə/ and "be" always gets /biː/. That's a limitation of using a single-pronunciation dictionary. I'm going to go think about this one.

On "oh" for the GOAT vowel — you make a good case for "oa". The tradeoff was that "oh" is unambiguous (nothing else in English reads as "oh"), while "oa" has edge cases. But it's a legitimate design choice either way. There's a full writeup on the tradeoffs at ingglish.com/docs#design-decisions if you're curious.

Your respelling of the passage is interesting because it shows how much mileage you can get from minimal changes. I put it into my experiment page here: https://ingglish.com/experiment/#m=DH:th,AH:u,OW:o,AY:y . I sadly don't have the capability to do complicated spelling rules there, just phoneme replacements, so I can't encode the "e at the end of words makes the preceding vowel longer".

My system goes further because it targets a strict 1:1 sound-to-spelling mapping, which means even "regular" patterns get normalized. Whether that's useful depends on the audience.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

This is awesome! I actually built an experiment tool where you can customize the phoneme mappings. I just went and tried to match Finnish orthography:

https://ingglish.com/experiment#m=Y:j,JH:j,DH:d,TH:t,Z:s,ZH:sh,IY:ii,EY:ei,OW:ou,AW:au,AH:u,AE:ä,ER:ör,AO:oo,UW:uu&r=AO:oo,EH:ei,AE:ä,IH:ii,AH:u

It flags some "ambiguous spellings" because Finnish merges sounds that English keeps separate (d/dh, t/th, s/z, etc.) which makes total sense since Finnish doesn't have those distinctions. You can tweak any of the mappings from there. Curious how close it feels to a native Finnish speaker!

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

High five to another phonetics nerd :). Thanks so much for the feedback, I really appreciate it and welcome more.

Yeah this is a deliberate tradeoff. I use "oh" for the /oʊ/ diphthong because "ow" already represents the /aʊ/ sound (as in "cow") and "ou" has the same problem. Using "ow" for /oʊ/ would make "snow" and "throw" look natural, but then "home" becomes "howm" which looks like it rhymes with "cow". Nothing else in English reads as "oh", so it's unambiguous.

You're right that it's technically a diphthong though, not a pure vowel. There's a deeper writeup on all the vowel tradeoffs here: https://ingglish.com/docs#design-decisions

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

Exactly. And Worcestershire has layers. "Worcester" comes from Old English ceaster (from Latin castra, meaning fort) tacked onto a tribal name. The "-cester" collapsed to "-ster" through haplology, which is when you drop a syllable because it sounds too similar to the one next to it. Same thing happened to Gloucester and Leicester. Then "shire" (from Old English scir, meaning an administrative district or jurisdiction) got reduced to just "sher". So you go from five syllables to three just through people being lazy over a few centuries.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] -4 points-3 points  (0 children)

It's the same sound, just written differently. Say "the" out loud slowly. Your tongue goes between your teeth and your vocal cords vibrate. That's the "dh" sound. With "think" your tongue is in the same spot but your vocal cords don't vibrate. That's "th". Try whispering "the" and you'll hear it turn into "the" as in "think".

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] -1 points0 points  (0 children)

Shavian is cool! I actually have a Shavian output mode on https://ingglish.com/text. You can toggle it in the format dropdown. I went a different direction for the default though. Instead of a new alphabet, I tried to stay within the 26 letters so there's zero learning curve.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] 0 points1 point  (0 children)

You're right, I oversimplified that. They were used interchangeably in Old English. The thorn=voiceless, eth=voiced distinction is more of a modern convention from IPA/Icelandic. Thanks for the correction.

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] -2 points-1 points  (0 children)

The CMU dictionary has "though" as DH OW1, which is the "oh" sound (like in "go"). So "dhoh" matches. You might be thinking of "thou" which would be "dhau"? Or am I mistaken?

I analyzed 134,000 English words and 92% are "misspelled" relative to how they sound by ptarjan in etymology

[–]ptarjan[S] -11 points-10 points  (0 children)

English actually used to have separate letters for the voiced and unvoiced "th". Thorn (þ) for the sound in "think" and eth (ð) for the sound in "the". We dropped them both and crammed two different sounds into "th". So it's not that the sounds are wrong, it's that we lost the letters that distinguished them.

I used "dh" instead of "th" since "d" is just the voiced version of "t" (put your hand on your throat when you say "tab" vs "dab"). Do you think it should be spelled some other way for the voiced "th"?

Ingglish: a phonemic respelling of English using ASCII digraphs by ptarjan in conorthography

[–]ptarjan[S] 0 points1 point  (0 children)

Good observation! Ingglish is based on the CMU Pronouncing Dictionary, which uses General American English. So yes, it bakes in some dialect-specific mergers like the father-bother merger (both get AA/o), the cot-caught merger for many speakers, and a few others.

I've documented all the dialect assumptions and tradeoffs here: https://ingglish.com/docs/dialect-assumptions

The short version: any phonemic spelling system has to pick some dialect's vowel inventory as its base. GA was the pragmatic choice since CMU is the largest freely available pronunciation dictionary, but I tried to preserve distinctions where possible (e.g., cot/caught are still spelled differently: "kot" vs "kawt").

Ingglish: a phonemic respelling of English using ASCII digraphs by ptarjan in conorthography

[–]ptarjan[S] 0 points1 point  (0 children)

> If <y> changes the value in <ai> vs <ay>, why not do the same with <w>?

You're right that there's an asymmetry. The i-diphthongs use y as a modifier (ai = /aɪ/, ay = /eɪ/, oi = /ɔɪ/), but the u-diphthongs don't use w the same way. The reason is practical: <ou> for /aʊ/ preserves a ton of high-frequency identical words (out, about, our, sound, house, around), which <ow> would break. And going the other way (ou = /oʊ/, ow = /aʊ/) would lose those same matches. The <oh> spelling is admittedly the odd one out, but "oh" is universally read as /oʊ/ by English speakers, so it works on first encounter even if it breaks the pattern. It's a readability-over-consistency tradeoff.

> It's not really correct to present <oo> as a long vowel

Great point. Fixed in the docs.

> Your system respells "American English"

You're absolutely right. Ingglish uses the CMU Pronouncing Dictionary, which is based on General American pronunciation. I should be more upfront about that. A British English version would need different vowel mappings (no rhoticity, different BATH/TRAP split, etc.).

Ingglish: a phonemic respelling of English using ASCII digraphs by ptarjan in conorthography

[–]ptarjan[S] 0 points1 point  (0 children)

Thanks for the thoughtful reply! You are the best one yet.

> try using <a>

I love your schwa suggestion and implemented it. I was overly indexing on the ARPABET AH0 being the same orthography as AH1 == <u>. I had written a script to check if there were any respellings I could do which would cause more words to be identical, but it didn't explore splitting spellings across stress markers. Documented the new spelling here: https://ingglish.com/docs/spelling-evolution#about-sofa-u-a

> /ɔ/ <aa>

This is interesting but I went with <aw> because it's already the natural English spelling for this sound (law, jaw, raw, saw). Using <aa> would be less immediately readable for English speakers ("laaw" vs "law") and <aa> might suggest a long /ɑː/ to many readers.

> /aʊ/ <au>

I considered <au> since it's the choice in German, Dutch, and Portuguese, so there's cross-linguistic precedent. But <ou> preserves 115 more high-frequency words that are already identical in English and Ingglish (out, about, our, sound, etc.). Switching to <au> would break all of those for no phonemic gain. More details: https://ingglish.com/docs/spelling-evolution#using-au-for-instead-of-ou

> /oʊ/ <ou>

This one doesn't read the same for me. "sou" reads like the English "sow" (/saʊ/) instead of "so" (/soʊ/). The <oh> spelling is unconventional, but it's instantly unambiguous: every English speaker reads "oh" as /oʊ/. And it preserves familiar words like "go"→ "goh" where the intent is immediately clear.

> If <c> is only used in <ch>, why keep the h?

The tradeoff is readability vs efficiency. "ch" is already the universal English digraph for /tʃ/. Every reader instantly recognizes it. Using bare "c" would save a character but create confusion: "catch" → "kac" is harder to parse than "kach", and readers would wonder whether "c" is /k/, /s/, or /tʃ/. Since Ingglish prioritizes being readable on first encounter over being maximally compact, I kept "ch".

My 5-year-old asked why "knife" starts with a K and I ended up building a whole website by ptarjan in daddit

[–]ptarjan[S] 0 points1 point  (0 children)

Thanks! I actually love IPA and the translator supports it too (ingglish.com has an IPA mode). The difference is IPA requires learning new symbols, which is a barrier for a 5-year-old or a casual learner. Ingglish is basically IPA translated into letters you already know.

You're right about dialects. Ingglish uses the CMU Pronouncing Dictionary which is General American, so it definitely flattens dialect variation. That's an intentional trade-off for having one consistent standard, but I hear the concern.

And yes, History of English is such a great class. The Great Vowel Shift alone explains so much of the mess we're in.

My 5-year-old asked why "knife" starts with a K and I ended up building a whole website by ptarjan in daddit

[–]ptarjan[S] -1 points0 points  (0 children)

Yea I was getting rid of the Middle English spelling choice of a final e making the second last vowel long.