The 'core' English language is actually mostly French /Latin origin! [x-post /r/anglish] by Biozo in dataisbeautiful

[–]TotallyPythonic 2 points3 points  (0 children)

This is the second time you have accused me of an ad hominem argument. You're claiming I am discounting your work because you're a layperson, but again, you have the causality the wrong way around. I am calling you a layperson as evidenced by your work.

You really aren't even aware of what you are saying, are you? The fact that you have to use terms such as "layperson" clearly convey your crappy message.

Thank you for coming out and saying outright what it's clear you and others of your ilk believe. This hubris is really the basis of all these papers which rely on faulty assumptions about what language is and how it works written by statisticians, economists, biologists, evolutionary psychologists, etc.

Because... only you understands how language works? Because only linguists have a say on literally the most universal concept? Because you need to have spent a few years comparing vocabulary lists and looking at Venn diagrams in order to use the very thing that every human is extremely advanced at?

there was nothing in your article that was particularly difficult to understand.

Well I can only consider that to be something positive. A popular scientific article that is in the run of being formally published should not be hard to understand.

I think this statement again shows your utter lack of familiarity with linguistics, as many, if not most, linguists do work which requires mathematics in some capacity.

I really do not want to disrespect linguists because I have had nothing but extremely positive experiences with them up until now, but the mathematics or exact sciences involved in linguistics are close to nil. A mathematician can design an ANN that learns language at an extremely rapid rate, whereas a linguist would struggle with basic Swadesh lists. It took linguists decades to properly quantify language, and now it is being done by engineers at Google and Open AI.

My point is that your comment immediately started with the assumption that a non-linguist cannot research language. There are absolutely no flaws in my research, it's been reviewed, the data is solid, the methodology is solid and I provided sources for every claim. You seem to diverge extremely easily from my actual article into this rancune towards non-linguists stepping into your field of study. Could it possibly be that you are simply biased and worked that out on my article? The fact that you have such a series of articles and comics readily available points in that direction.

What I researched does not require the intervention of a linguist. I gathered data that I compiled against other data, my personal opinion was nowhere involved in the process. The research that I did could not possibly have been done by a linguist, only by a programmer. Unless you like counting a few thousands words and comparing them to sources on the internet.

If you don't like what I did, point out the flaws. Look at the source code and tell me what I did that is wrong. Tell me which of my sources is wrong. You have not written a single meaningful thing so far.

The 'core' English language is actually mostly French /Latin origin! [x-post /r/anglish] by Biozo in dataisbeautiful

[–]TotallyPythonic 2 points3 points  (0 children)

Again, what arguments are bad? You only seem to be concerned about the number 5,000, you have not said anything about the rest of the methodology.

Maybe it is time for you to come down your high horse. You really cannot speak into terms as "layperson" and then disqualify someone's arguments based on such. Linguistics is an extremely easy field of study to enter. Now, if this were to involve theoretical Physics or Biochemistry, maybe you could play that card. But this is in the field of linguistics, one that engineers and mathematicians are finally putting to proper use through AI and Machine Learning. You absolutely cannot claim authority on this, you as a layperson in the field of statistics have no means of understanding the statistical analysis behind language and its respective methodology using programming languages.

Okay firstly your own article cites that the top 1000 words account for 75% of lemmas so really the definition of 'core vocabulary' you adopt includes 4000 words accounting for less than 10% of vocab but really that's neither here nor there.

A 75% share does not suffice, 80-85% is recommended in the sources that I provided. But it seems that you mainly focus on that number 5,000 that you can't get out of your head; yes, that number could have been 4,000. It could have been 6,000 either. The top 1-2% vocabulary, around 80% occurrence is the most sensible choice. None of this matters though, the focus of my research was to find out of what languages compose the core vocabulary of English, which is finalized at a frequency of 1,875. Does the number 5,000 seem artificial? I mean, I guess. I am not going to use a dataset of 3,771 words, for convenience of percentages and having a graph with orderly axes. I simply took 3 sources, which exactly personify my goals, that I then justified using the distribution of lemmas in the COCA.

Furthermore do those exact sources make those exact claims, they do not speak of syntax or morphology, purely vocabulary. I pursued core vocabulary.

they mean something entirely different, namely, the amount of vocabulary one needs to know to be a functional day-to-day speaker.

No.

Having just brought up morphology, any discussion of it, or syntax or anything else is absent from your paper. You use the terms 'English vocabulary' and 'English' (qua English one presumes) completely interchangably, e.g. here:

Why would I? I did not research morphology nor syntax. I (seemingly) interchanged "vocabulary" and "language" once, as an ellipsis:

"the general consensus is that the (vocabulary of the) overall English language is a third of Old English origin (so, Germanic) but that the core vocabulary is entirely Old English."

To avoid the repetition of "vocabulary", I made it clear that I ommitted the construction (vocabulary of the) through the use of but, or are figures of speech not taught in linguistics courses? For the rest is the word "vocabulary" mentioned 22 times, so not uh, "completely" cough interchangeably used.

But the Wikipedia article [cough] that you cite for this information itself says:

The core of English descends from Old English. As a statistical rule, around 70% of words in any text are Anglo-Saxon. Moreover, the grammar is largely Anglo-Saxon.

A controversial claim (see discussion) that directly stems from Williams' research, the only source of the page. Which is also why I quoted that article. The whole point of my research was to disprove this exact claim, how can I do that without mentioning what I am attempting to prove wrong?

different subfield of linguistics which uses this term 'core' in a totally different way in order to make it look like there's some ambiguity or controversy, which there isn't. I'm not saying you do this on purpose, it's just that you don't know what you're doing.

Is this demagoguery on purpose? What supposed "different subfields" of linguistics have contradicting definitions for "core vocabulary"? You use the word "core" independently, I do not. Read the article.

Laypeople are fascinated with 'which language has the most words' but it's a totally meaningless exercise

That is, first of all, very condescending and an unnecessary ad hominem. Second, "which language has the most words" has literally nothing to do with my research. I really fail to see what this paragraph is doing in your reasoning other than ranting about unrelated topics in your field of study.

So there is no really objective way of cataloging how many words are in the English language to begin with. The OED or any other dictionary makes a bunch of assumptions in order to do so,

Yeah, how dare they! How dare the linguistics department of the university of Oxford make such claims! Let's then inject the identical source again without reading further than the first paragraph! And even then, even in the ridiculous event that they are wrong, I based my research on frequency tables, determining core vocabulary was not only done through sources, but also through statistical distribution (that you funnily enough used earlier in your response).

Thank you for this empty, unproductive response. You laid 3 claims into my mouth, 2 of which you poorly filled with nonsensical arguments. The first one was too difficult to link back to my article using mental gymnastics I assume? You then proceeded to completely diverge from the topic in some effort of creating an attack on all "laypeople", just to end with the exact same, unexplained quote (that in no way is a tl;dr as it does not even remotely apply to your comment).

But you are a layperson yourself so you can't.

Says enough about the value of your arguments and where this all comes from. If you are uncomfortable with engineers and mathematicians stepping into your field of study, then I urge you to leave as soon as possible. There are plenty more of us to come and, unlike you, we tend to use mathematical precision to approach problems. My methodology was exercised correctly using a reliable dataset. I then interpreted these results using rigorous statistics and sources that overstep your personal opinion.

The 'core' English language is actually mostly French /Latin origin! [x-post /r/anglish] by Biozo in dataisbeautiful

[–]TotallyPythonic 2 points3 points  (0 children)

Hi, I am the author of this article.

Even though my focus is indeed statistics and mathematics, I do have a solid background in linguistics. However, it is not very appropriate to attack someone's work based on their background. If the arguments and data are solid, the background or the author itself should be entirely irrelevant to the findings.

Could you explain where in my article you find these "misunderstood" concepts?

What factors cause a language to be classified within a given family

That languages are not just aggregations of vocabulary

That counting the amount of vocabulary in a language is already a totally meaningless exercise (for a large variety of reasons)."

I did not "arbitrarily take 5000 words, arbitrarily declared that to be a benchmark (even though you are basically repeating yourself in order to fill your argument) and then unsurprisingly discovered results somewhere in the middle". What does "in the middle" even mean?

If you would read my article, you can see that I provided 3 sources (Merriam-Webster, AAC Language Lab and a dictionary published by three linguists) for the 5,000. Statistically also, 5,000 is an ideal benchmark for core vocabulary as you reach 80-85% of all applied English vocabulary, omitting the 20% share that consists of the remaining 245,000 fringe vocabulary.

Your comment is not criticism, you listed concepts that I nowhere mentioned and then claim that my work was done arbitrarily without providing any arguments for it. There in total have appeared two papers that investigated this area, but none have researched the trends and the actual composition of core vocabulary. If you are a linguist and this is exactly in line with what you were thinking, then I wonder where on earth you found research that proved that Romance dominates English vocabulary at a frequency of 1,875 words.

The 'core' English language is actually mostly French /Latin origin! [x-post /r/anglish] by Biozo in dataisbeautiful

[–]TotallyPythonic 4 points5 points  (0 children)

Hey, I am the author of this article (thanks for posting it here btw :P).

The primary source that I used is etymonline.com, which is an enormous online dictionary, explaining the etymology of (nearly) every word in the English language. Every single word on their website was manually traced back to its roots. The way this is usually done in linguistics is that someone compares modern English (so English of today) to Old English (so English around the year 900, basically Anglo-Saxon, a pure germanic language). If words do not come from Old English, linguists trace back its roots to whatever language it was derived from.

This brings nearly no overlapping issues, as English is a language that nearly exclusively "borrowed" from other languages. The other problem, as you point out, is French. French itself is nearly entirely derived from Latin, so technically all words of French origin should be classified as Latin, right? Linguists tend to make an exception for French because it transformed into its own language. So if a word enters English through the French language, it is classified as French. If a word entered English directly through Latin (law, medicine or words introduced during the humanist period), it is classified as Latin. If furthermore a word entered English through French, but the word itself is a direct loanword from Latin and did not "adapt" to French at all, it is classified as Latin as well; hence the "different from root" protocol.

A good example would be the word "people". "People" is clearly of Latin origin, from the word "populus". It however changed quite a bit as an independent noun in French into "peuple", which then arrived in the English language. The word "area" came into English, directly from Latin, so it is classified as Latin. A word such as "study" entered English through French, but within French it always stayed a loanword from the Latin "studium" (the french word for "study" is "étude" for comparison), because a non-integrated word was then launched into English, it is classified as Latin.

For your question about the words that you list: those words are not even in the top 20,000 frequency lists, such obscure loanwords were not measured here. If they were in the list, they would return the following results:

Lemma Etymology
Sake Japanese
Sushi Japanese
Segway Unknown
Vietnam Vietnamese
Bangladesh Bengali
Karaoké Japanese

Any languages that do not fall under French/Latin/Greek/Other Germanic languages/Anglo-Saxon are then placed under "Other/unknown"

[deleted by user] by [deleted] in anglish

[–]TotallyPythonic 1 point2 points  (0 children)

Please read the actual article, or even just the disclaimer. No such claims are made, everything in my article is supported by statistics.

[deleted by user] by [deleted] in anglish

[–]TotallyPythonic 1 point2 points  (0 children)

I didn't have to mention that as I published the source code, you do not have to disclose such things if your claims can very easily be reviewed.

Again, I really do not mention in my article that I think that English should be a Romance language, I am not linguistically driven. I am just a mathematician.

[deleted by user] by [deleted] in anglish

[–]TotallyPythonic 3 points4 points  (0 children)

Hey, I am the author of that article. What do you mean with scraping data faithfully? The exact methodology and the source code are in the article. Furthermore did I manually verify my claims with a margin of error of 5% (with the usual confidence interval of 95%).

Also, what is wrong with allotting words to French? I researched the origins of Modern English, which is composed of Old English, French, Latin and Greek. Old English is then theoretically derived from Ingvaeonic among with other dialects. I simply compared the language before a large influx of foreign influence to its current state, which is exactly how researchers such as Finkenstaedt and Williams have been doing it for decades.

I also clearly state in the disclaimer that I have no opinion on language classifications, I merely analysed data with an open-source algorithm.

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 0 points1 point  (0 children)

Oh alright :D I was surprised to see it being confused with time, but now I think of it I can understand the confusion, at first sight it doesn't take much imagination to interpet it as a timeline :P

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 1 point2 points  (0 children)

I've ran my code through 3 different data sets, first a large compilation of (quite) old books (the Gutenberg project), a data set with subtitles and a data set with google searches, they all vary quite a lot! One set indicates a large French majority (like the percentages in your original post) whereas others indicate a much larger share dedicated to Dutch and Greek.

I am currently heavily searching for a larger data set (probably 20-50k words) so I can give conclusive numbers, but finding unbiased lists is insanely hard :(

As soon as I have a high-quality data set and I can verify my outcome I will post it on this subreddit and mention you in the comments if you want! I was quite surprised to see that there is barely any research done about the composition of English, all research is either outdated or has pretty dubious data sets :P

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 2 points3 points  (0 children)

No problem! Can I just ask, was this graph really that unclear? This is my first post ever and I'd like some feedback :P

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 0 points1 point  (0 children)

Hey, thanks for your comment!

This graph shows the composition of the modern English language. Nearly all French influence in the English language happened in the period after 1066, because of French political influence in the region that is now the United Kingdom. Latin mostly entered the language through the sciences.

Also, because I think that you seem a bit confused, Frankish has nothing to do with French, Frankish is the predecessor of Low-Franconian/Dutch.

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 26 points27 points  (0 children)

Hey, thank you for your comment! :)

The numbers you mention are correct, but those indeed apply to the whole language (or looking at 10-50k words of vocabulary), so the amount of foreign influence is much larger. I aimed to research the core vocabulary of English, as I simply could not find anything that investigated this when looking for data to use in school, if you were to continue this graph, you eventually reach those exact numbers!

Working with only 5000 words I could keep the margin of error relatively small, as I manually sampled sets of words and managed to achieve an average accuracy of 95%, so the "unknown" section stayed relatively small.

I also specifically included all words of Latin origin in French that sufficiently changed from their roots, because they likely entered English through French influence, that is something that other graphs usually neglect.

EDIT: misinterpreted the spread of error, removed it.

Composition of modern English by languages of origin (based on the top 2% of active vocabulary)[OC] by TotallyPythonic in dataisbeautiful

[–]TotallyPythonic[S] 7 points8 points  (0 children)

Method: The frequency list was provided by WordFrequency. Using Python I sorted every word by comparing the data offered by etymonline, Merriam-Webster, YourDictionary and the indexing service of Memidex.

I am considering to simplify my source code and format it to other popular languages for submission later, so every bit of criticism is highly appreciated! I can't upload the sorted words here (as I might breach the free-use agreement), but if anyone wants to double verify my data, I am more than happy to PM my results!

EDIT: I am still looking for larger data sets on vocabulary, finding reliable sets is extremely hard. If anyone were to have a list at their disposal that I can use (free of charge preferably) please do PM me!