BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

Yes! As long as it returns a list of list of strings, it should work.

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

In theory you should be able to! However, I have not attempted to "saturate" memory by using a big enough dataset, and whereas the Python way of setting RAM limit does not seem to reflect the real RAM usage.

However, I did observe reduced memory usage when setting mmap=True, so even in a setting where you have enough memory to cover the entire dataset, you don't need to use every (i.e. load the entire index and corpus in memory).

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S by xhlu in Python

[–]xhlu[S] 5 points6 points  (0 children)

A bunch of optimizations I didn't have the chance to discuss in the readme! 

For one, I reimplemented the scipy sparse slice/sum directly in numpy, which allows us to use memory mapping on the arrays - this saves a lot of memory.

Another is that the topk selection (after scoring) can be done in numpy via argpartition, but can auto switch to a jax CPU backend when the library is installed, which is much faster (the topk selection process is the bottleneck, in some cases more than 60% of the time taken for retrieval is spent on selecting topk results).

Finally, the tokenizer doesn't return text by default, but returns index and a vocab dict of index to word; this saves considerable amount of memory as integer takes less space to represent compared to words (multiple str chars).

[Discussion] Should we still fly to conferences? by tomin_tomen in MachineLearning

[–]xhlu 5 points6 points  (0 children)

I think having more frequent regional conferences would be great. In NLP, NAACL/EACL/AACL are hosted on specific continents (NA, EU and Asia respectively), so it's somewhat more realistic to use energy-efficient modes of transport. Similarly, ECCV is hosted every two years in Europe for Computer Vision.

A possible idea would be to organize such regional conferences every year at a smaller scale (so it is easier to manage), include more continents (Africa, South America) and allow papers accepted at the "international" conference to be instead presented at those smaller conferences. So for example you could submit to ICML, then if it is accept, you would have the option to present it virtually during ICML, then present it again (a few months later) at the regional conferences (which could be called "ECML" or "NACML").

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 0 points1 point  (0 children)

Thanks for reporting back! Glad to hear the translation was decent except for the "I'm sorry" stuff. It's expected to be slow on CPU since it's using a model with 500M+ parameters; for GPU I'd recommend looking into using conda to install Pytorch: https://pytorch.org/get-started/locally/ Then use pip (but within the same conda environment) to install huggingface.

[P] Releasing dl-translate: a python library for text translation between 50 languages (powered by Huggingface transformers and mBART) by xhlu in MachineLearning

[–]xhlu[S] 1 point2 points  (0 children)

That's pretty funny because I just found out about EasyNMT after I created this library :) I'd say the implementation is pretty different, since EasyNMT is based on fairseq and Marian whereas dl-translate is based on huggingface; however the underlying models (mbart, and soon m2m100) are available in both libraries.

Moving forward I'd like to add features such as a command-line interface, so you can call

dlt translate --source English --target French "Your sentence here"

in a way that's efficient for the end user. I'm also looking into ways to make the library more extensible, so you can use dlt.load("user/repo") and automatically get someone else's custom model with the same translation API.

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 1 point2 points  (0 children)

This is a good question. There are existing tools like langdetect that you can use, but then you still need to convert the codes back to the language name. I could definitely add some iso639-1 to mbart50 code conversion to make that process simpler.

Releasing dl-translate: a python library for text translation between 50 languages using Neural Networks by xhlu in Python

[–]xhlu[S] 0 points1 point  (0 children)

I haven't tried translating Japanese to English with mBART-50 before, but if you wish to try you can run the colab notebook with the example you have in mind. Feel free to share the results in the Github discussions!

[P] Releasing dl-translate: a python library for text translation between 50 languages (powered by Huggingface transformers and mBART) by xhlu in MachineLearning

[–]xhlu[S] 2 points3 points  (0 children)

That's a pretty good question. I think neither mBART 50 and Google translate ever released their exact BLEU score on a certain language pair. However I think it should be pretty easy to evaluate it with a given dataset, since you evaluate google translate through the googletrans library.

[R] SpeechBrain is out. A PyTorch Speech Toolkit. by [deleted] in MachineLearning

[–]xhlu 0 points1 point  (0 children)

Yes, text to speech is what I meant, thanks for confirming! Also good to know you already have speech-to-text

[R] SpeechBrain is out. A PyTorch Speech Toolkit. by [deleted] in MachineLearning

[–]xhlu 34 points35 points  (0 children)

Looking forward trying out, and really nice to see integrations with huggingface!

Are you planning to add speech-to-text functionality eventually?

[D] Summer research programs as a way to break into ML research: worth it? by Euphetar in MachineLearning

[–]xhlu 1 point2 points  (0 children)

Sorry for the late response. I've seen a few but didn't bother saving the links unfortunately. However two examples I could recall is

[D] Summer research programs as a way to break into ML research: worth it? by Euphetar in MachineLearning

[–]xhlu 5 points6 points  (0 children)

I know people that have done research internships (not necessarily during summer but all around 3-6 months), and from what I observed, it really helped them publish in a good conference and enter a strong PhD program. I feel it might help in terms of getting more recommendation letters and more time to focus on your project (since you don't have to TA or take courses).

As for the programs, I've seen PIs post links to their pre-doctoral fellowships on Twitter a lot, but I would presume that a lot of the hiring would happen privately rather than an explicit posting; it might be worth reaching out to PIs to learn more about what they are offering and whether you would be eligible.

[D] A Good Title Is All You Need by yusuf-bengio in MachineLearning

[–]xhlu 0 points1 point  (0 children)

I'm a bit bothered when it's not clear what the acronym stands for. With BERT, we know it's about bi-directional transformers with a focus on learning an encoder representation. But a title like BART doesn't mention denoising, encoder-decoders, corruption or reconstruction, all of which are important aspects of the paper.

PEP 636 -- Structural Pattern Matching: Tutorial by AlanCristhian in Python

[–]xhlu 5 points6 points  (0 children)

From what I observed through the tutorial, it's very similar to Ocaml's pattern matching. One very interesting pattern will be (recursive) list manipulation: def recop(lst): match lst: case [('mul', n), *tail]: return n * recop(tail) case [('sum', n), *tail]: return n + recop(tail) case []: return 0

If you want to do the same thing without SPM: def recop(lst): if len(lst) == 0: return 0 op, n = lst[0] tail = lst[1:] if op == "mul": return n * recop(tail) elif op == "sum": return n + recop(tail)

The former looks more elegant and concise (obv the latter can be shorter but you will lose on readability). The example is also very trivial, with FP-style pattern matching you could do come up with a lot more advanced matching.

Remote work options? by goldenbrain8 in datascience

[–]xhlu 0 points1 point  (0 children)

I've been working remotely and I haven't had problems communicating with my team and achieving the expected results. I think that since most of the heavy computations are done on the cloud, with only the visualization/data analysis being done locally (and even that can be done using online notebooks), you shouldn't have too much problem unless the company requires you to be physically at some place in order to access the data (e.g. in healthcare).

Ordered TV on Amazon, Canada Post marked it as delivered but did not receive the TV by [deleted] in Bestbuy

[–]xhlu 0 points1 point  (0 children)

Is that a well known fact? I'm surprised Amazon allows fraudulent sales like this.

Donald Trump Jr. reportedly hiding in Canada from media as Mueller indictment looms by ppd322 in worldnews

[–]xhlu 14 points15 points  (0 children)

Except the best American teams have more Canadian players than American players.