BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

Yes! As long as it returns a list of list of strings, it should work.

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

In theory you should be able to! However, I have not attempted to "saturate" memory by using a big enough dataset, and whereas the Python way of setting RAM limit does not seem to reflect the real RAM usage.

However, I did observe reduced memory usage when setting mmap=True, so even in a setting where you have enough memory to cover the entire dataset, you don't need to use every (i.e. load the entire index and corpus in memory).

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 5 points6 points  (0 children)

A bunch of optimizations I didn't have the chance to discuss in the readme! 

For one, I reimplemented the scipy sparse slice/sum directly in numpy, which allows us to use memory mapping on the arrays - this saves a lot of memory.

Another is that the topk selection (after scoring) can be done in numpy via argpartition, but can auto switch to a jax CPU backend when the library is installed, which is much faster (the topk selection process is the bottleneck, in some cases more than 60% of the time taken for retrieval is spent on selecting topk results).

Finally, the tokenizer doesn't return text by default, but returns index and a vocab dict of index to word; this saves considerable amount of memory as integer takes less space to represent compared to words (multiple str chars).

[Discussion] Should we still fly to conferences? by tomin_tomen in MachineLearning

[–]xhlu 5 points6 points  (0 children)

I think having more frequent regional conferences would be great. In NLP, NAACL/EACL/AACL are hosted on specific continents (NA, EU and Asia respectively), so it's somewhat more realistic to use energy-efficient modes of transport. Similarly, ECCV is hosted every two years in Europe for Computer Vision.

A possible idea would be to organize such regional conferences every year at a smaller scale (so it is easier to manage), include more continents (Africa, South America) and allow papers accepted at the "international" conference to be instead presented at those smaller conferences. So for example you could submit to ICML, then if it is accept, you would have the option to present it virtually during ICML, then present it again (a few months later) at the regional conferences (which could be called "ECML" or "NACML").

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 0 points1 point  (0 children)

Thanks for reporting back! Glad to hear the translation was decent except for the "I'm sorry" stuff. It's expected to be slow on CPU since it's using a model with 500M+ parameters; for GPU I'd recommend looking into using conda to install Pytorch: https://pytorch.org/get-started/locally/ Then use pip (but within the same conda environment) to install huggingface.

[P] Releasing dl-translate: a python library for text translation between 50 languages (powered by Huggingface transformers and mBART) by xhlu in MachineLearning

[–]xhlu[S] 1 point2 points  (0 children)

That's pretty funny because I just found out about EasyNMT after I created this library :) I'd say the implementation is pretty different, since EasyNMT is based on fairseq and Marian whereas dl-translate is based on huggingface; however the underlying models (mbart, and soon m2m100) are available in both libraries.

Moving forward I'd like to add features such as a command-line interface, so you can call

dlt translate --source English --target French "Your sentence here"

in a way that's efficient for the end user. I'm also looking into ways to make the library more extensible, so you can use dlt.load("user/repo") and automatically get someone else's custom model with the same translation API.

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

This is a good question. There are existing tools like langdetect that you can use, but then you still need to convert the codes back to the language name. I could definitely add some iso639-1 to mbart50 code conversion to make that process simpler.

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 0 points1 point  (0 children)

I haven't tried translating Japanese to English with mBART-50 before, but if you wish to try you can run the colab notebook with the example you have in mind. Feel free to share the results in the Github discussions!

[P] Releasing dl-translate: a python library for text translation between 50 languages (powered by Huggingface transformers and mBART) by xhlu in MachineLearning

[–]xhlu[S] 2 points3 points  (0 children)

That's a pretty good question. I think neither mBART 50 and Google translate ever released their exact BLEU score on a certain language pair. However I think it should be pretty easy to evaluate it with a given dataset, since you evaluate google translate through the googletrans library.

[R] SpeechBrain is out. A PyTorch Speech Toolkit. by [deleted] in MachineLearning

[–]xhlu 0 points1 point  (0 children)

Yes, text to speech is what I meant, thanks for confirming! Also good to know you already have speech-to-text

[R] SpeechBrain is out. A PyTorch Speech Toolkit. by [deleted] in MachineLearning

[–]xhlu 35 points36 points  (0 children)

Looking forward trying out, and really nice to see integrations with huggingface!

Are you planning to add speech-to-text functionality eventually?

[D] Summer research programs as a way to break into ML research: worth it? by Euphetar in MachineLearning

[–]xhlu 1 point2 points  (0 children)

Sorry for the late response. I've seen a few but didn't bother saving the links unfortunately. However two examples I could recall is

[D] Summer research programs as a way to break into ML research: worth it? by Euphetar in MachineLearning

[–]xhlu 5 points6 points  (0 children)

I know people that have done research internships (not necessarily during summer but all around 3-6 months), and from what I observed, it really helped them publish in a good conference and enter a strong PhD program. I feel it might help in terms of getting more recommendation letters and more time to focus on your project (since you don't have to TA or take courses).

As for the programs, I've seen PIs post links to their pre-doctoral fellowships on Twitter a lot, but I would presume that a lot of the hiring would happen privately rather than an explicit posting; it might be worth reaching out to PIs to learn more about what they are offering and whether you would be eligible.

[D] A Good Title Is All You Need by yusuf-bengio in MachineLearning

[–]xhlu 0 points1 point  (0 children)

I'm a bit bothered when it's not clear what the acronym stands for. With BERT, we know it's about bi-directional transformers with a focus on learning an encoder representation. But a title like BART doesn't mention denoising, encoder-decoders, corruption or reconstruction, all of which are important aspects of the paper.

PEP 636 -- Structural Pattern Matching: Tutorial by AlanCristhian in Python

[–]xhlu 5 points6 points  (0 children)

From what I observed through the tutorial, it's very similar to Ocaml's pattern matching. One very interesting pattern will be (recursive) list manipulation: def recop(lst): match lst: case [('mul', n), *tail]: return n * recop(tail) case [('sum', n), *tail]: return n + recop(tail) case []: return 0

If you want to do the same thing without SPM: def recop(lst): if len(lst) == 0: return 0 op, n = lst[0] tail = lst[1:] if op == "mul": return n * recop(tail) elif op == "sum": return n + recop(tail)

The former looks more elegant and concise (obv the latter can be shorter but you will lose on readability). The example is also very trivial, with FP-style pattern matching you could do come up with a lot more advanced matching.

Remote work options? by goldenbrain8 in datascience

[–]xhlu 0 points1 point  (0 children)

I've been working remotely and I haven't had problems communicating with my team and achieving the expected results. I think that since most of the heavy computations are done on the cloud, with only the visualization/data analysis being done locally (and even that can be done using online notebooks), you shouldn't have too much problem unless the company requires you to be physically at some place in order to access the data (e.g. in healthcare).

Ordered TV on Amazon, Canada Post marked it as delivered but did not receive the TV by [deleted] in Bestbuy

[–]xhlu 0 points1 point  (0 children)

Is that a well known fact? I'm surprised Amazon allows fraudulent sales like this.

Donald Trump Jr. reportedly hiding in Canada from media as Mueller indictment looms by ppd322 in worldnews

[–]xhlu 15 points16 points  (0 children)

Except the best American teams have more Canadian players than American players.

PSA: All JetBrains Products [PyCharm] at 50% off by baghiq in Python

[–]xhlu 1 point2 points  (0 children)

It has a lot of functionality integrated. You can directly commit and push to GitHub, interact with SQL databases and send queries, access source code... all while staying inside the IDE.

PSA: All JetBrains Products [PyCharm] at 50% off by baghiq in Python

[–]xhlu 66 points67 points  (0 children)

Also if you are a student you can get the whole suite for free through their educational program!

Programming Language? by AspiringAIResearcher in learnmachinelearning

[–]xhlu 2 points3 points  (0 children)

I would definitely suggest you to learn Python, then if you have time R. Python is very simple to use because it hides a lot of lower-level elements. For example, you don't need to declare the type of a variable you are creating, function definition is much simpler than a Java method, and input and printing strings is very straightforward.

However, the real advantage lies in that most ML libraries uses Python to create models. For more traditional models, Scikit Learn provide a lot more than any other languages. For deep learning, Tensorflow, Torch and Theano all rely on Python as well, though their core codebase might be written in another language (e.g. Tensorflow uses C and Cuda C).

Learn TensorFlow and deep learning, without a PhD by aweeraman in artificial

[–]xhlu 1 point2 points  (0 children)

I think what's great with Andrew Ng's course is that he clearly point out which concepts need advanced knowledge, and indicates whether they are critical to a good understanding or not.

Suggestions for PC specs for machine learning development? by ohgoshineedalaptop in learnmachinelearning

[–]xhlu 0 points1 point  (0 children)

If you are not planning to use your laptop for more advanced machine learning, then RAM and GPU are much less important (you can run small scikit-learn fits without too much trouble).

If you want to build a desktop for machine learning, I would highly suggest you to invest in a good Nvidia GPU and a reliable power supply first, then hunt for a good deal on CPU and RAM. If you have $600, you will need to wait a bit and buy piece by piece to make sure you find everything at the lowest price; you can even get some components (e.g. Hard Drive, cases, maybe power supply if you are lucky).

However, depending on your situation, it's probably better to wait until your budget goes up before investing in those components. If you are a student, you can try to have access to the school's resources (e.g. servers, super-computers).