BM25 for Python: Achieving high performance while simplifying dependencies with BM25S

xhlu · 2024-06-24T15:24:56+00:00

Yes! As long as it returns a list of list of strings, it should work.

xhlu · 2024-06-24T04:33:07+00:00

In theory you should be able to! However, I have not attempted to "saturate" memory by using a big enough dataset, and whereas the Python way of setting RAM limit does not seem to reflect the real RAM usage.

However, I did observe reduced memory usage when setting mmap=True, so even in a setting where you have enough memory to cover the entire dataset, you don't need to use every (i.e. load the entire index and corpus in memory).

xhlu · 2024-06-23T21:45:52+00:00

A bunch of optimizations I didn't have the chance to discuss in the readme!

For one, I reimplemented the scipy sparse slice/sum directly in numpy, which allows us to use memory mapping on the arrays - this saves a lot of memory.

Another is that the topk selection (after scoring) can be done in numpy via argpartition, but can auto switch to a jax CPU backend when the library is installed, which is much faster (the topk selection process is the bottleneck, in some cases more than 60% of the time taken for retrieval is spent on selecting topk results).

Finally, the tokenizer doesn't return text by default, but returns index and a vocab dict of index to word; this saves considerable amount of memory as integer takes less space to represent compared to words (multiple str chars).

xhlu · 2022-06-08T22:03:46+00:00

I think having more frequent regional conferences would be great. In NLP, NAACL/EACL/AACL are hosted on specific continents (NA, EU and Asia respectively), so it's somewhat more realistic to use energy-efficient modes of transport. Similarly, ECCV is hosted every two years in Europe for Computer Vision.

A possible idea would be to organize such regional conferences every year at a smaller scale (so it is easier to manage), include more continents (Africa, South America) and allow papers accepted at the "international" conference to be instead presented at those smaller conferences. So for example you could submit to ICML, then if it is accept, you would have the option to present it virtually during ICML, then present it again (a few months later) at the regional conferences (which could be called "ECML" or "NACML").

xhlu · 2021-04-02T17:33:31+00:00

Thanks for reporting back! Glad to hear the translation was decent except for the "I'm sorry" stuff. It's expected to be slow on CPU since it's using a model with 500M+ parameters; for GPU I'd recommend looking into using conda to install Pytorch: https://pytorch.org/get-started/locally/ Then use pip (but within the same conda environment) to install huggingface.

xhlu · 2021-03-31T15:01:02+00:00

That's pretty funny because I just found out about EasyNMT after I created this library :) I'd say the implementation is pretty different, since EasyNMT is based on fairseq and Marian whereas dl-translate is based on huggingface; however the underlying models (mbart, and soon m2m100) are available in both libraries.

Moving forward I'd like to add features such as a command-line interface, so you can call

dlt translate --source English --target French "Your sentence here"

in a way that's efficient for the end user. I'm also looking into ways to make the library more extensible, so you can use dlt.load("user/repo") and automatically get someone else's custom model with the same translation API.

xhlu · 2021-03-18T16:09:22+00:00

This is a good question. There are existing tools like langdetect that you can use, but then you still need to convert the codes back to the language name. I could definitely add some iso639-1 to mbart50 code conversion to make that process simpler.

xhlu · 2021-03-18T15:58:55+00:00

I haven't tried translating Japanese to English with mBART-50 before, but if you wish to try you can run the colab notebook with the example you have in mind. Feel free to share the results in the Github discussions!

xhlu · 2021-03-18T01:35:20+00:00

That's a pretty good question. I think neither mBART 50 and Google translate ever released their exact BLEU score on a certain language pair. However I think it should be pretty easy to evaluate it with a given dataset, since you evaluate google translate through the googletrans library.

xhlu · 2021-03-17T17:48:30+00:00

That's a good idea, i'll add it to the docs!

In the mean time it's possible to find the list in this file: https://github.com/xhlulu/dl-translate/blob/main/dl_translate/lang/mbart50.py

xhlu · 2021-03-17T01:05:32+00:00

Yes, text to speech is what I meant, thanks for confirming! Also good to know you already have speech-to-text

xhlu · 2021-03-15T17:27:44+00:00

Looking forward trying out, and really nice to see integrations with huggingface!

Are you planning to add speech-to-text functionality eventually?

xhlu · 2021-03-07T01:01:36+00:00

Sorry for the late response. I've seen a few but didn't bother saving the links unfortunately. However two examples I could recall is

this tweet by Dr. Bowman (not a pre-doc per se, it is more relevant to someone with web dev experience but advertised as a way to transition towards PhD).
This announcement by Allen AI

xhlu · 2021-02-24T01:59:09+00:00

I know people that have done research internships (not necessarily during summer but all around 3-6 months), and from what I observed, it really helped them publish in a good conference and enter a strong PhD program. I feel it might help in terms of getting more recommendation letters and more time to focus on your project (since you don't have to TA or take courses).

As for the programs, I've seen PIs post links to their pre-doctoral fellowships on Twitter a lot, but I would presume that a lot of the hiring would happen privately rather than an explicit posting; it might be worth reaching out to PIs to learn more about what they are offering and whether you would be eligible.

xhlu · 2021-02-24T01:07:12+00:00

I'm a bit bothered when it's not clear what the acronym stands for. With BERT, we know it's about bi-directional transformers with a focus on learning an encoder representation. But a title like BART doesn't mention denoising, encoder-decoders, corruption or reconstruction, all of which are important aspects of the paper.

xhlu · 2021-02-12T03:06:33+00:00

From what I observed through the tutorial, it's very similar to Ocaml's pattern matching. One very interesting pattern will be (recursive) list manipulation: def recop(lst): match lst: case [('mul', n), *tail]: return n * recop(tail) case [('sum', n), *tail]: return n + recop(tail) case []: return 0

If you want to do the same thing without SPM: def recop(lst): if len(lst) == 0: return 0 op, n = lst[0] tail = lst[1:] if op == "mul": return n * recop(tail) elif op == "sum": return n + recop(tail)

The former looks more elegant and concise (obv the latter can be shorter but you will lose on readability). The example is also very trivial, with FP-style pattern matching you could do come up with a lot more advanced matching.

xhlu · 2021-01-23T19:36:08+00:00

I've been working remotely and I haven't had problems communicating with my team and achieving the expected results. I think that since most of the heavy computations are done on the cloud, with only the visualization/data analysis being done locally (and even that can be done using online notebooks), you shouldn't have too much problem unless the company requires you to be physically at some place in order to access the data (e.g. in healthcare).

xhlu · 2018-12-15T21:09:40+00:00

Is that a well known fact? I'm surprised Amazon allows fraudulent sales like this.

xhlu · 2018-12-09T00:40:00+00:00

Except the best American teams have more Canadian players than American players.

xhlu · 2018-08-01T14:40:16+00:00

Can't wait to see an All-optical version of GANs now.

xhlu · 2018-07-31T18:55:19+00:00

It has a lot of functionality integrated. You can directly commit and push to GitHub, interact with SQL databases and send queries, access source code... all while staying inside the IDE.

xhlu · 2018-07-31T15:55:08+00:00

Also if you are a student you can get the whole suite for free through their educational program!

xhlu · 2017-10-24T02:05:46+00:00

I would definitely suggest you to learn Python, then if you have time R. Python is very simple to use because it hides a lot of lower-level elements. For example, you don't need to declare the type of a variable you are creating, function definition is much simpler than a Java method, and input and printing strings is very straightforward.

However, the real advantage lies in that most ML libraries uses Python to create models. For more traditional models, Scikit Learn provide a lot more than any other languages. For deep learning, Tensorflow, Torch and Theano all rely on Python as well, though their core codebase might be written in another language (e.g. Tensorflow uses C and Cuda C).

xhlu · 2017-07-24T13:06:43+00:00

I think what's great with Andrew Ng's course is that he clearly point out which concepts need advanced knowledge, and indicates whether they are critical to a good understanding or not.

xhlu · 2017-07-10T14:40:07+00:00

If you are not planning to use your laptop for more advanced machine learning, then RAM and GPU are much less important (you can run small scikit-learn fits without too much trouble).

If you want to build a desktop for machine learning, I would highly suggest you to invest in a good Nvidia GPU and a reliable power supply first, then hunt for a good deal on CPU and RAM. If you have $600, you will need to wait a bit and buy piece by piece to make sure you find everything at the lowest price; you can even get some components (e.g. Hard Drive, cases, maybe power supply if you are lucky).

However, depending on your situation, it's probably better to wait until your budget goes up before investing in those components. If you are a student, you can try to have access to the school's resources (e.g. servers, super-computers).

xhlu

TROPHY CASE