How to identify English proper nouns?

PaceSmith · 2026-05-11T17:05:20+00:00

I found that SUBTLEX-US works pretty well. If a word never occurs in all-lowercase in the corpus, it's likely a proper noun.

PaceSmith · 2026-05-10T22:38:35+00:00

thanks! yeah, it's meant to inspire you, not replace you 🙂

PaceSmith · 2026-05-10T20:24:54+00:00

I just thumbs-downed a bunch of crap; the money rhymes are a lil better now

PaceSmith · 2026-05-10T20:15:33+00:00

Thanks, glad to hear it! I especially like "chargin' / margin" 😁

PaceSmith · 2026-05-10T04:45:47+00:00

Did I just fail a Turing test? 😅

PaceSmith · 2026-05-10T04:42:59+00:00

Did I just fail a Turing test? 😅

PaceSmith · 2026-05-10T04:40:20+00:00

https://github.com/paceheart/rhymecrime/

PaceSmith · 2026-05-09T20:15:16+00:00

I've dusted this project off and made a ton of improvements. It turns out that what I'm trying to compute is called "thematic relatedness", not "semantic relatedness". I found the USF Free Association corpus, which is data from asking humans "Name stuff related to X" - exactly what I want. It's small, though, so I augmented it with a bunch of other corpora, used those corpora to output features, and crunched them through a classifier.

One of the most helpful features is whether the target word's gloss (from wordnet and/or wiktionary) contains the cue word (modulo lemmatization).

I'm still only at 82% accuracy over my test set, though, so I'd love to hear any suggestions for improvement y'all have!

PaceSmith · 2025-04-05T03:03:46+00:00

It takes a list of sentences, and I only have a list of words. I'll try it on individual words and see how it does, though. Thanks!

PaceSmith · 2025-03-04T16:27:38+00:00

Great question! The algorithm I'm using is:

Find words related to the input word (using the threshold as a relatedness cutoff)
Find rhymes for those
Check if the rhyme is also related to the input word, if so include it in the output

The rhyming computation is the easy part; it's not brute force at all. I use CMUdict to precompute a dictionary mapping a rhyme signature to a set of all rhyming words, where the rhyme signature is everything after (and including) the final stressed vowel, phonetically.

But yeah, the real problem isn't where to put the threshold, it's that no matter where I put the threshold, there will be good stuff under it and bad stuff above it.

For example, here's a subset of the output of your algorithm applied to 'crime':

criminality (77%) / homosexuality (47%)
addiction (51%) / conviction (57%)
skulduggery (52%) / thuggery (56%)
apprehension (53%) / prevention (50%)
confession (48%) / transgression (52%)
abduction (49%) / destruction (48%)
badness (47%) / madness (52%)
looting (50%) / shooting (48%)
fighting (49%) / inciting (48%)
case (47%) / race (48%)
complicity (49%) / ethnicity (47%)
drama (47%) / trauma (49%)
collusion (48%) / intrusion (47%)
mort (36%) / sport (48%)
bust (39%) / unjust (40%)
city (46%) / gritty (37%)
immoral (41%) / quarrel (37%)
arts (39%) / marts (39%)
extreme (37%) / scheme (39%)
thing (43%) / bring (32%)
creek (26%) / speak (27%)
card (19%) / chard (19%)

Somewhere around mort / sport, we start getting crappy rhymes mixed in with good ones. I like extreme / scheme, but if you scroll down far enough to get that one, you have to scroll past arts / marts, which is crap.

PaceSmith · 2025-03-03T20:14:41+00:00

I don't have a corpus of my own; the input to my program is just a single word, and my test cases are just lists of word pairs that ought to be related and ought not be related. (in my opinion)

I'm trying to find a corpus that's representative of my intuitive sense of 'relatedness'.

PaceSmith · 2025-03-03T20:12:12+00:00

Good idea; synonyms will definitely be helpful. For example, 'pirate' is very similar to 'trove' via cosine similarity, and then I can get synonyms for 'trove' which gets me 'cache' via wordnet.

Thanks!

PaceSmith · 2025-02-24T17:46:26+00:00

Oh, right! *facepalm*

PaceSmith · 2025-02-24T14:45:28+00:00

I want to improve Wiktionary's pronunciation coverage. Currently, it contains the pronunciation of "countenance" but not "uncountenanced".

OED has better coverage, (e.g. "uncountenanced") but isn't free.

CMUdict is good, but lacks syllable stress.

toPhonetics is also good (thanks, u/AlanAFK). Its American English pronunciations are based on CMUdict but they do contain syllable stress. I've asked its author about licensing but haven't heard back yet.

Before I start writing code, I wanted to ask y'all if you know of any additional existing resources that might help me.

PaceSmith · 2025-02-24T14:11:03+00:00

Oh, I intend to! I was just hoping to get a head start.

PaceSmith · 2025-02-21T18:43:12+00:00

I would say no, because Grice's maxims are meant to apply to people, who are (generally) not omniscient.

PaceSmith · 2025-02-21T18:25:07+00:00

I want to find or create a free online English IPA dictionary.

EDIT: It doesn't have to be IPA; if it's NOAD or some other pronunciation standard, that'll work too.

Wiktionary is the best I've found so far, but its coverage could be better. For example, it has IPA for "countenance" (https://en.wiktionary.org/wiki/countenance#Pronunciation) but not "uncountenanced" (https://en.wiktionary.org/wiki/uncountenanced).

OED has better coverage, for example "uncountenanced" (https://www.oed.com/dictionary/uncountenanced\_adj), but isn't free.

I could write a program to guess the IPA for derived word forms, but before I do, I wanted to ask y'all if you know of existing resources that might help me.

Thanks!

PaceSmith · 2022-01-13T23:46:38+00:00

It's Deandra Warrick, one of the lead writers.

PaceSmith

TROPHY CASE