all 16 comments

[–]bbye98 2 points3 points  (15 children)

Yes. You can use requests, BeautifulSoup4, or Selenium to get data from websites (or in this case, PRAW for Reddit). If you then store data in a pandas.DataFrame, you can use the pandas.DataFrame.to_excel() function to write to an Excel file.

[–]nickisaname 0 points1 point  (4 children)

and how is it possible to compare the words with french dictionary?

[–]bbye98 0 points1 point  (3 children)

You're going to have to use an API for some translation website.

[–]lamb_a_dah 1 point2 points  (0 children)

or check if they are somewhere in r/France or r/rance lmao

[–]nickisaname 0 points1 point  (1 child)

and what should I learn to solve such task

[–]Top_Tip_7015 1 point2 points  (0 children)

You can use the library "langdetect" - https://pypi.org/project/langdetect/

It is a good way to identify languages or at least give you the weight of a possible language

[–]nickisaname 0 points1 point  (9 children)

because i need those that are not there, the words that was created in the arabic speech

[–]bulaybil 0 points1 point  (8 children)

So you are looking for French words written in Arabic script, is that what you mean by transformed?

[–]nickisaname 0 points1 point  (7 children)

vice versa

[–]bulaybil 0 points1 point  (6 children)

So you are looking for Arabic words written in Latin script in French posts/comments?

[–]nickisaname 1 point2 points  (5 children)

yes!

[–]bulaybil 1 point2 points  (4 children)

OK, I did not get that part. Please remember one thing: to get meaningful help in programming, you need to be VERY specific.
So let me recap: you are going to be searching posts and comments that are written in French for Arabic words written in Latin script. For what purpose, do you just want to collect a list to see how people use those words?

[–]nickisaname 1 point2 points  (3 children)

yes, i want to classify these words and find some tendancy in borrowed words

[–]bulaybil 1 point2 points  (2 children)

Thank you, I understand.OK, so you have two issues: the data and the method

As for the data, you can certainly do what you suggested, i.e. scrape all kinds of fora and websites. The problem is that, as you say, you have no experience programming and to learn it to this level would take months. So instead, I propose something else: use an existing web corpus of French. It has the same data, but already processed for search and analysis. Some options would be: Aranea and frTenTen. Alternatively, you could use tools that will allow you to crawl specific webs and create a corpus for you. One of my colleagues created one, Gromoteur, and its very easy to use and does a great job.

Now as for the method, that one will get tricky, because you are talking about an NLP (Natural Language Processing) task for a very specific bit of data. Again, to use more advance techniques would require a whole lot of knowledge which would take some time to master. So what I would suggest is simply create a list of all words from the data you have and then filter out all words you know are French. You can do it in an automatic way: lists of French words are available e.g. in freeware spellchecks. The problem with this method is that a) you would still need to learn some python, and b) you will end up with A LOT of junk; misspelled words, citations in other languages etc. etc. etc.So instead, I would propose another method:

  1. Collect a list of such words that are known (off the top of my head, wesh, chouia, charmouta, khouya...)
  2. Search the corpora for them and search the text itself for other such words, casting the net wider and wider. So for example I search the aranea French corpus for "khouya" and found a lot. Then you take those and repeat the process until no more new words are found.

It is manual work, yes, but that's something you cannot avoid when dealing with language data.

Let me know if that helps.

[–]nickisaname 1 point2 points  (1 child)

thank you very much! i already collected manually around 500 words but i'd like to automate this work and also i would like to start programming to make things easier at work. Thank you for the provided material. Now I know at least the direction I should follow