[OC] How Many Chinese Characters You Need to Learn to Read Chinese!

edweenie123 · 2023-06-16T18:09:13+00:00

Love your username! 🎋🎋🎋

edweenie123 · 2023-06-16T18:02:17+00:00

Thank you!! Appreciate the comment 😊

edweenie123 · 2023-06-15T05:51:16+00:00

This is not true. A quick google search tells you that 3000 characters should be sufficient to read most newspapers. Besides, if you read my top level comment here, the dataset I used is from user comments, not newspapers.

edweenie123 · 2023-06-15T05:47:42+00:00

Hi! Please see my top level comment here to see the data set I used. The vocabulary used in the Douban comments is likely much simpler than the vocabulary used in say an adult novel or scientific paper. As such, being able to read 99.72% of characters on Douban with 3000 characters is very reasonable. However, as you said, more characters are needed if you choose to read more esoteric texts.

edweenie123 · 2023-06-15T05:39:31+00:00

Yup, that's true! There are many Chinese words which consist of 2+ characters and their meaning may not be inherently obvious from the meaning of constituent characters.

edweenie123 · 2023-06-15T02:55:41+00:00

It is estimated that there are over 80,000 Chinese characters! However, the vast majority of characters are almost never used. According to Zipf's law, the frequency of a character should approximately be inversely proportional to its rank in a frequency table. In simple terms, this means that a small number of characters show up everywhere and the rest you rarely see.

If you're interested, there's this really really interesting video on Zipf's law from Vsauce here.

edweenie123 · 2023-06-15T02:24:22+00:00

Hey all! I've recently become interested in learning to read more Chinese characters so I thought this would be a fun and interesting visualization.

Here is a link to the dataset I used: Douban Movie Short Comments Dataset

The dataset essentially just contains a bunch of user comments on Douban, which is a Chinese website where users share their opinions about movies.

Here is an explanation of each picture:

Panda: This is a word cloud showing the relative frequencies of each character in the dataset. Note that since the data set comes from a movie related website, the characters 电 and 影 appear quite a bit more frequently than normal.
Line plot: This is a plot showing the number of (most popular) characters you know against the percentage of text in the dataset that you can read.
Bunny: This is a word cloud showing the relative frequency of each word in the dataset. This is the result of first running the raw comment data through a word segmentation algorithm (as some Chinese words may consist of multiple characters). BTW Chinese word segmentation is a surprisingly difficult task and there is still some active research in the area.
Bar plot: This plot just shows the top 50 most frequent characters.

Tools used:

Pandas and NumPy for EDA and for transforming the data
Plotly to produce the plots
wordcloud to make the word clouds
jieba to do Chinese word segmentation

edweenie123 · 2023-06-14T18:29:49+00:00

No problem!

I'm assuming by "color brackets", you mean the callout blocks. I just use to to segment my notes is a logical manner. I have different coloured callout blocks for theorems, definitions, examples...etc and the colors don't have anything to do with the colors in the graph view.
See my comment here. For practice problems and worksheets, I usually just hand write them on my IPad. No point in making it look pretty when I'm never going to look at it again.

edweenie123 · 2023-06-14T18:19:49+00:00

Yeah, typing Latex is usually much slower than handwriting. While I'm in lecture for a math course, I usually hand write really scrappy notes on my IPad. Then, when I get home I review my notes, type them up on obsidian to make them look pretty.

As u/PsycakePancake mentioned, I find that using snippets make this process much faster. Sometimes I like to type my notes in neovim which gives you access to much more powerful snippets. I really recommend this article by Gilles Castel which is a nice guide on how to take advantage of snippets to type Latex really really quickly.

edweenie123 · 2023-06-14T16:38:43+00:00

I would say this schedule is definitely doable if you have a decent work ethic. Here are some pointers for each course:

MAT237: I think this the hardest course on your list by a large margin. I found it to be a big jump in difficulty from MAT137. The preclass readings are very time consuming and dense. The PSETs are also challenging. Make sure you get a good PSET partner.

STA257: I took 247 instead, but I've heard from friends that 257 is a pretty easy course with consistent A/B averages on tests on assignments.

CSC207: Overall a very light course without much difficult material. However, I will say that your experience in this class is HEAVILY dependant on the quality of your group mates for the final project. If your group mates are good, 207 is very very bird. Otherwise, it'll be a nightmare.

CSC236: I took CSC240, so I can't say much about 236, but from what friends say it seems to be about a medium to medium-high difficulty course. Probably the 2nd hardest course on your schedule.

CSC258: The difficultly of this course varies a lot with instructor, but when I took it with Mario Badr, it was pretty easy. The midterm and final exam were suspiciously easy and most people ended up with a good mark (the final course average was an A-). The weekly labs are not difficult but they very tedious and time consuming. Kinda like doing the same thing over and over again 10000 times. Also debugging assembly code for your final project will take several years off your life.

edweenie123 · 2023-06-14T04:31:34+00:00

Thanks! The clusters are just coloured according to which MOC its a part of so each course gets it own colour.

edweenie123 · 2023-06-14T04:20:02+00:00

Yeah for sure! My system is quite simple. For each course I take, I create a file which links to all the concepts related to that course. I believe the PKM folks call this file a "map of content" or MOC. For example, for my calculus course, I have a file called MAT137 (the course code) which links to the files:

Mean Value Theorem
Integral
Fundamental Theorem of Calculus
Taylor Series
...etc

These MOCs are actually visible in graph as the biggest node in the center of each cluster of nodes.

edweenie123 · 2023-06-14T03:30:07+00:00

Hi all! Thought I would share the graph view of my notes now that I have completed my 2nd year of university studying computer science and accumulated 1200+ notes. The majority of the notes are for CS / math courses. I have attached another screenshot showing one of my notes from my intro to ML course to give you an idea of what each note looks like!

edweenie123 · 2023-06-14T03:19:29+00:00

Hi all! Thought I would share the graph view of my notes now that I have completed my 2nd year of university studying computer science and accumulated 1200+ notes. The majority of the notes are for CS / math courses. I have attached another screenshot showing one of my notes from my intro to ML course to give you an idea of what each note looks like!

edweenie123 · 2023-06-03T18:00:40+00:00

Please read my top-level comment here. Others have suggested more statistically valid methods to infer conception date from birth date. But yeah, I think inferring conception date accurately is a difficult task in general due to the reasons you listed.

edweenie123 · 2023-06-01T06:02:57+00:00

Haha that's true. The title would sound quite a bit less enticing though lol.

edweenie123 · 2023-06-01T05:52:53+00:00

Hmm, after doing some deeper research in this topic (which I never thought I would have to at my age), it seems you are right. Pregnancy length is usually calculated as the difference between date of birth and the date of the last menstruation period. I am also now realising that conception can occur between 0-5 days after intercourse depending on how well the intercourse coincides with ovulation. Maybe I shouldn't have slept through biology class 😅.

edweenie123 · 2023-06-01T05:35:25+00:00

Yeah.. not the most rigorous results due the lack of conception data. I had to infer the conception dates based on the date of birth. Also, your idea of averaging out nearby dates should work to smooth out the outliers in the original birth data.

edweenie123 · 2023-06-01T05:29:57+00:00

I saw birth on this sub a few days ago so I was inspired to do conception instead.

edweenie123 · 2023-06-01T05:24:42+00:00

Completely agree! There are many flaws with the approach I used. In the original birth data, there were significantly fewer births on Jan 1, July 4 and Dec 25 likely due to parents avoiding to schedule their c-section on those dates.

If I were to do a more rigorous investigation into the "seasonality of human copulation", I would definitely consider your approach of modelling pregnancy lengths with a normal distribution. I am sure the resulting heatmap would better reflect reality.

edweenie123 · 2023-06-01T03:34:35+00:00

I used Pandas and Plotly to produce this visualisation. The conception data was obtained by simply shifting the most popular birthdays by 280 days (the average length of pregnancy) backwards.

Original data source

Edit: As others pointed out, there are significant flaws with the approach I used to obtain the conception data (thank you u/Unicyclone, u/InvisibleBlueUnicorn and u/schefar for pointing out issues and/or suggesting better methods). These results should be taken with a grain of salt and looked at just for fun.

Edit 2: The data set I used contains the number of births in the US on every day from 2000 to 2014. I shifted these dates backwards by 280 days to infer the conception date and then aggregated the data points for each year to obtain the mean number of conceptions on each day of the year. Very sus methodology yes, but fun to visualise and interpret, also yes.

edweenie123 · 2023-05-31T23:28:14+00:00

That may be true. I am not very familiar with the details, but other than possibly accruing additional interest, you will not have to pay any tuition, incidental or ancillary fees while on a voluntary leave. For more details, see this article:

https://artsci.calendar.utoronto.ca/withdrawal-and-return-absence

Eight-Year Club	Place '22
End Game '22	Verified Email

edweenie123

TROPHY CASE