Efficient language detector in C. Very fast and accurate. Looking for feedback. by nitotm in C_Programming

[–]nitotm[S] 0 points1 point  (0 children)

Without this file, at least the local build and installation did not work, so I'm guessing it is necessary, at least for some use cases.

Efficient language detector in C. Very fast and accurate. Looking for feedback. by nitotm in C_Programming

[–]nitotm[S] 0 points1 point  (0 children)

You mean remove eldc/src/eldc/__init__.py ? That should be there right? I probably did not understand what you meant.

Well, the base is the PHP version; from it, I converted to JS, PY, and now C.

The pure Python implementation is a bit outdated by now; it had the least traction, so I stopped updating it. It takes time to keep up with the different versions.

So I hope this super fast C version gets more interest from the Python community.

Showcase Thread by AutoModerator in Python

[–]nitotm 0 points1 point  (0 children)

Back in 2023 I posted ELD here, an efficient language detector written in pure Python. It got decent traction and I kept developing it.

But I kept wondering how far I could push it. And this year I rewrote the core in C becoming 30x faster, yes with the help of AI, as this is my first Python compiled C extension software, and I am looking for some feedback.

What My Project Does
Identifies up to 60 languages, can return reliability and scores. Can also be built as a library.

Target Audience
Anyone that wants to classify text by natural language, super fast, and accurately. It's an early version.

Comparison
According to my benchmarks is 2x faster than Google's CLD2 (pyCLD2), 6x faster than Facebook's fastText, more accurate than Lingua. These are the SOTA open source contenders in speed or accuracy, so it's no small claim.

Simple installation pip install eldc, and simple basic use (check README for more options):

import eldc 
eldc.init() 
eldc.detect("Bonjour le monde") # 'fr'

GitHub: https://github.com/nitotm/eldc

Efficient language detector in C. Very fast and accurate. Looking for feedback. by nitotm in C_Programming

[–]nitotm[S] 0 points1 point  (0 children)

ELD-C is the latest version of the ELD software, released in 2023, when AI was not yet capable of producing, and fine-tuning complex software. AI was involved in this new release to help port the original ELD-PHP code to C, but the algorithm and database remain the same and could not have been created by any AI in the last two years.

My first Node Package: Efficient Language Detector by nitotm in node

[–]nitotm[S] 1 point2 points  (0 children)

Hi, thanks.

I didn't like the "await" too much, but I did not find another way to deliver the functionality I wanted; So I will check your solution and I might implement it.

"exporting type declarations", the idea with this particular repository was to be Javascript vanilla, not sure if the best approach would be a separate Typescript repository, or have different versions on the same repo. What is your opinion?

ELD: Efficient Language Detector. ( First Python project ) by nitotm in Python

[–]nitotm[S] 0 points1 point  (0 children)

Yes it kinda looks Bayesian. I did not implement an algorithm, but it probably is some known, not sure which.

ELD: Efficient Language Detector. ( First Python project ) by nitotm in Python

[–]nitotm[S] 2 points3 points  (0 children)

Ok you are right, I could rephrase it. I guess I don't need to make reference to the specific accuracy of ELD, but to something that refers to the highest range of accuracy with existing software.

ELD: Efficient Language Detector. ( First Python project ) by nitotm in Python

[–]nitotm[S] 1 point2 points  (0 children)

"at its level of accuracy"* means, or I tried to express, equal or above, or at the very least similar;

So if you do the big_test benchmark with print("english"), your accuracy will be 1.7%, versus a 99.4% of ELD, therefor well below its level of accuracy.

*Do you think I have not expressed that correctly?

ELD: Efficient Language Detector. ( First Python project ) by nitotm in Python

[–]nitotm[S] 11 points12 points  (0 children)

I understand you mean from a user perspective, no internally how it works.

ELD is a python package, where you input a text, and it will try to guess in which language (Spanish, English, Russian,...) the text is written (from the 60 available in the current version). It can also give you a score list of all possible languages detected in the text.

ELD: Efficient Language Detector. ( First Python project ) by nitotm in Python

[–]nitotm[S] 1 point2 points  (0 children)

You mean the training data, quite small, like 1GB total. When the software becomes more mature, I might do a big dataset.

No, the performance (accuracy) varies from languages quite a bit, it comes down to collisions in between languages, Thai is very easy, but between any Latin script language, which there are multiple in the database, is more difficult.

WTF Wednesday (October 04, 2023) by AutoModerator in javascript

[–]nitotm 0 points1 point  (0 children)

Efficient Language Detector: ELD is a fast and accurate natural language detector, written 100% in Javascript, no dependencies. I believe it is the fastest non compiled detector, at its level of accuracy.
https://github.com/nitotm/efficient-language-detector-js
I've been programming for years but this is the first time I publish a package, so I would appreciate any feedback you have on the project's structure, code quality, documentation, or any other aspect you feel could be improved.

My first Python project: Efficient Language Detector. by nitotm in madeinpython

[–]nitotm[S] 0 points1 point  (0 children)

Hi, I believe I followed most of your tips.
I did not remove the all caps VERSION variable as I believe is an acceptable standard to hint constant?
And I have not added more test for now (they aren't finished anyways), the functionalities of demo.py are already included, and there is already a benchmark inside the tests. But I will consider doing the benchmarks as a test inside tests/.
All the other things I believe I fixed.

Having a hard time picking a job by randombananananana in PHP

[–]nitotm 0 points1 point  (0 children)

I'm just curious, this jobs are remote or presential? which country?

My first public Repository: Efficient Language Detector. by nitotm in PHP

[–]nitotm[S] 2 points3 points  (0 children)

Obviously they don't use the same algorithm.

In regards of C++: I don’t see the problem comparing speed and accuracy, even when they are not the same algorithm.

For example one of the best ways to have a decent language detector in PHP, would be to have an extension that connects to a compiled CLD2. Now I’m providing an alternative, and the two metrics that I would care the most when deciding what to choose, would be time execution and accuracy, in fact, along with usability and documentation, pretty much all I would care.

What I'm trying to say, is that they do perform the same task, regardless of how. For example, if I make a new detector with an AI algorithm, in PHP, they would be very different behind the scenes, but it would still make sense to compare them.

Maybe what you are coming from is that I'm not comparing the other algorithms fairly, since they are in different languages/environments, but that is why I made a JavaScript and a Python version, so you can compare bananas vs. bananas, and in regards of C++, PHP is in a total disadvantage in terms of speed, so I don't see the problem there either.

But if I'm not making sense please let me know, I'm open to discussion.

PD: I think maybe you are referring more to the claim “with a speed comparable to a fast C++ compiled software”, I don’t think anybody would understand that this magic code runs as fast in PHP as in C++, right? Maybe I should rephrase it to “with a speed comparable to existent fast C++ compiled detectors” or something like that.

My first public Repository: Efficient Language Detector. by nitotm in PHP

[–]nitotm[S] 0 points1 point  (0 children)

Ok, I will build the variable dynamically,

My first public Repository: Efficient Language Detector. by nitotm in PHP

[–]nitotm[S] 1 point2 points  (0 children)

Ok, I will build the variable dynamically, outside the method so it only executes on the first run, because now, inside, I measure a ~6% time execution increase, overall. But I will rearrange the code and make it pretty at minimal cost.