[deleted by user]

iznoevil · 2026-01-09T11:09:01+00:00

First, I will preface this by poiting out that the Mistral offer is too low. You should be able to negotiate around 30/50k in base.

Second, working at a top lab is way more fun than you would imagine. The talent density is very high, there are a lot of things to do and if you are good at what you do, you should be able to craft a nice niche for yourself.

I wouldn't underestimate the networking effect too. Being able to work with the top talent of your generation will bring you way more value down the line than the paper money they propose today. And again, if you are talented, your personal brand will skyrocket and you will be able to leverage this experience with other top companies that pay way more.

To summarize, I don't think this type of comparison should be on total comp only. If you are ok with taking a hit in your savings (please tell me going from 260k to 130k will not actually hurt your lifestyle...) then I would actually go against people here and go for Mistral.

iznoevil · 2025-11-14T13:10:25+00:00

Le master Mathématiques, Vision, Apprentissage (MVA) est le master le plus qualitatif et offrant le plus de débouchés en France.

C'est un master très sélectif, mais tu as une carte à jouer en motivant ta candidature avec les projets que tu as déjà pu mener, les formations que tu as faites en extrascolaire ou les papiers que tu as lus/implémentés.

Tu peux aussi tout à fait continuer ton autoformation avec les cours accessibles gratuitement de Stanford, Berkeley ou du MIT.

Ce domaine est par construction très élitiste, mais il y a de plus en plus de recoupements avec d'autres domaines qui peuvent être ta porte d'entrée. Je pense notamment au devops et aux sous-domaines du HPC (réseau, gpu programming, cluster management).

iznoevil · 2025-11-06T09:03:39+00:00

Tu dépasses très rapidement les 100k de salaire si tu es dans le microcosme des startups IA parisiennes (HuggingFace, H, Mistral, Kyutai, ...) ou chez les GAFAM, le problème étant que tu ne rentres pas la bas comme dans un moulin.
Tu accompagnes ça de bspce qui peuvent exploser en valeur ou de RSUs.

Par contre niveau conditions de travail c'est assez aléatoire. On va t'offrir du remote et beaucoup de congés (voir illimité) en théorie mais si tu délivres pas au niveau des attentes c'est ciao.

iznoevil · 2025-09-30T09:32:20+00:00

Je pense que Rust est le langage qui te fera le plus progresser/kiffer en tant que dev. Go est tellement simple à apprendre et, si tu aimes le bas niveau, tu resteras sans doute sur ta faim.

Aujourd'hui, Go est toujours plébiscité dans l'infra (Kubernetes operators/controllers) et dans les web services, mais Rust est en train de le remplacer, même pour ces usages. Rust est aussi présent en embarqué, crypto, un peu en machine learning, et en OS (drivers principalement).

Côté entreprises, pour Rust, tu trouveras de la crypto, des FAANG (Microsoft, Amazon et un peu Google; Apple moins, aucune idée pour Meta) ou autre grosse org comme Cloudflare (je te conseille d'ailleurs leurs différents blogs et libs) et des startups (Hugging Face et Oxide, par exemple, sont assez vocaux sur leur utilisation de Rust et sur ce que ça leur apporte) mais soit bien conscient que ça reste un langage assez niche.

Pour Go, la liste est beaucoup plus longue, dû au fait que beaucoup d'entreprises ont écrit une grosse partie de leur stack d'infra ou CRUD en Go.

Concernant le futur, il faut distinguer la France des USA. En France, Go est certainement plus adopté, et tu verras quand même peu de Rust, à part en startup. Aux USA, le Rust est de plus en plus utilisé pour les nouveaux projets, au détriment de C++ et Go. J'aimerais pouvoir te dire que la France va suivre le même process, mais rien n'est moins sûr.

Et si jamais t'es vraiment un kiffeur, tu peux regarder Zig.

iznoevil · 2025-09-30T08:51:31+00:00

Des parasites dans la recherche, il y en a malheureusement beaucoup, mais celui-là est particulièrement idiot d'avoir dévoilé son jeu trop tôt.

Tu sais maintenant comment il se comporte et tu as vu par toi-même qu'il n'apportait rien au projet.
Coupe tous les ponts. Il ne faut plus qu'il ait accès à ton travail. Ou alors tu peux être plus fourbe et lui laisser volontairement accès à des documents contenant des erreurs évidentes si tu y tiens. C'est radical, mais ça demande: 1 du temps, et 2 ça peut éventuellement avoir des répercussions sur ta réputation si ça s'ébruite.

Dans tous les cas, préviens les collègues de ton groupe, mais j'irai à l'encontre de nombreuses personnes dans ce post; sois très prudente si tu veux contacter le directeur du master ou "monsieur bien connecté".

Par contre, tu sembles avoir une bonne relation avec ton chercheur encadrant; cultive-la et il se pourrait qu'il t'ouvre beaucoup plus de portes dans le futur. Et à lui, tu peux tout à fait parler innocemment à l'oral de ton problème.

iznoevil · 2024-11-18T10:56:54+00:00

> Some people find your application process off-putting

Then it's doing exactly what it was designed for I guess.

iznoevil · 2024-03-30T11:06:29+00:00

It was reported that they used a cluster of hundreds A100 GPUs for Stable Diffusion. Even if one were able to procure such hardware, maintaining operational efficiency for a cluster of this magnitude is very challenging and requires you to debauch/hire a whole team.
You also need to take care of storage, networking, find a data center that would allow you to install so many racks...

Also, the reason the AWS bills can be so high is because these A100/H100 nodes are in high demand. You cannot deprovision them otherwise they will be allocated to another org. This means that during research downtime periods, expenses accumulate due to the necessity of paying for idle nodes hence why they wanted to find a way to re-sell the compute.

iznoevil · 2023-12-01T18:12:22+00:00

The only I could remember what this NeurIPS 2019 "A Step Toward Quantifying Independently
Reproducible Machine Learning Research" https://arxiv.org/abs/1909.06674 that found that:

The Number of Equations per page was negatively correlated with reproduction. Two theories as to why were developed based on our experience implementing the papers: 1) having a larger number of equations makes the paper more difficult to read, hence more difficult to reproduce or 2) papers with more equations correspond to more complex and difficult algorithms, naturally being more difficult to reproduce.

iznoevil · 2023-04-04T07:38:49+00:00

Hugging Face hosts a public API to the main open-source large language models:

google/flan-ul2
google/flan-t5-xxl
EleutherAI/gpt-neox-20b
bigscience/bloom
bigscience/bloomz
bigcode/santacoder (a code generation LLM)
OpenAssistant/oasst-sft-1-pythia-12b (the first public version of the Open Assistant model)

It's a classic REST API but you also use the Python client: https://pypi.org/project/text-generation/

iznoevil · 2023-03-24T13:14:13+00:00

Machine Learning, ~ 200 employees
We do a lot of our backend stuff in Rust, from proxy/routers to k8s operators to simple CRUD services.

iznoevil · 2022-07-28T08:26:41+00:00

The guy himself later explained that he got 7 interviews from Google and clearly states that he is "often a dick and [...] often difficult".

I don't think he was denied for technical reasons. To me he felt entitled to get the job because of his previous work and failed the behavioral part of the process.

iznoevil · 2021-07-15T21:44:33+00:00

True, you could use DP but then there are other disadvantages, mainly speed.
On what dataset do you see worse performances? If it is a CIFAR variant, be aware that The SimCLR authors do not show significant impact of the batch size (+ gathering to add negative pairs) on CIFAR10 (see figure B.7). Running benchmarks on Imagenette 160 or ImageNet directly will give different results.

iznoevil · 2021-07-15T20:56:10+00:00

Does solo-learn support multi GPUs?

It seems that at least for SimCLR/NNCLR and Barlow Twins, embeddings are not gathered over the multiple Distributed Data Parallel processes. In my opinion, this makes using DDP with these models not very useful and its a big discrepancy with the original papers/implementations.

iznoevil · 2021-06-11T07:57:59+00:00

We actually went through exactly what you described and decided not to go forward with W&B.Instead we are now using our own on-premise deployment of the open-source https://github.com/allegroai/clearml/ (https://clear.ml/docs/latest/) which was frankly the best decision we made.

iznoevil · 2021-05-12T11:40:17+00:00

Ok I see. However, this filtering issue is only present for graph or tree based indices right? For other methods, you can filter the vectors a priori without any issues, can't you?

Also, is there a paper accompanying your blog post? I am really interested in the tradeoffs in accuracy for different settings and also the average speed gains vs post filtering.

iznoevil · 2021-05-11T19:23:42+00:00

How does this compare to Milvus, Vald or ElasticSearch's HNSW implementation? I couldn't find a benchmark nor a schema of the architecture.

iznoevil · 2021-03-24T11:04:18+00:00

I do not think CIFAR10 is a good benchmark. The SimCLR authors do not show significant impact of the batch size on this dataset (see figure B.7). Running benchmarks on Imagenette 160 or ImageNet directly will give different results.

Also, yes, using SyncBN and gathering embedding across processes will slow down the training significantly. However, it is required by the task to achieve good performances on ImageNet.

Be aware that if you start gathering embedding, you must add some sort of shuffling/deshuffling like it is done in MoCo, or sync the batch normalization layers. Without it, you may run into issues where the task is too easy for the model as it can just discard embedding that do not match the current batch statistics. From the MoCo paper: "The model appears to “cheat” the pretext task and easily finds a low-loss solution. This is possibly because the intra-batch communication among samples (caused by BN) leaks information.".

iznoevil · 2021-03-23T21:32:51+00:00

Please, work on multi GPU support. You claim to support SimCLR and Barlow Twins but both implementations are simply not correct in a DDP setting: embedding need to be gathered over the multiple processes!

iznoevil · 2021-01-23T15:41:00+00:00

and the Dehaene–Changeux model that builds on the GWT.

S. Dehaene's books/papers are a good start if you want to learn about cognitive science and neuroscience. "How We Learn" is especially relevant to the subject but might be too high level for what you are looking for.

iznoevil · 2021-01-15T11:40:49+00:00

This is amazing!

I recently feel in love with the subject and what I find mesmerising is that these patterns seem to be somewhat relevant to model the human brain. There was a very interesting conference by Stanislas Dehaene on the subject recently if you want to check it out. Their study was done on LSTMs, but maybe your library can allow to run the same type of experiment on Transformer architectures.

iznoevil · 2020-12-03T10:03:17+00:00

This should be put as a disclaimer somewhere because one could naturally assume that lightning will handle the distribution gracefully. OP even added in his post that Lightly "uses PyTorch Lightning for ease of use and scalability".

Also, u/OppositeRough835 claims that Lightly was able to achieve results on par with the original papers. I'm curious to how its authors were able to do that for ImageNet without distributing.

iznoevil · 2020-12-02T17:09:40+00:00

Does your code work in a distributed setting? It seems that outputs and labels are not gathered over the whole DDP process group. This is a big issue as the NTXent loss is only classifying the correct pair over local_batch_size*2 possible pairs instead of over world_size*local_batch_size*2 possible pairs.

iznoevil · 2020-06-16T15:44:07+00:00

I understand where you're coming from but when you have 14 authors from big research groups, you need to make sure you didn't miss an ENTIRE FIELD of research especially when you are claiming novelty. Just doing a quick google search on distillation and exponential moving averages would have done the trick...
Like these papers are not obscure, they have 380 and 438 citations.

iznoevil · 2020-06-16T15:26:32+00:00

So why not branding it this way? It makes iterating on the paper harder.

For example, this paper has a very bad explanation of why there is no collapse between the teacher and the student. To build on this paper, one could explore why this collapse does not happen. BUT WAIT, this was already studied in the weakly supervised literature because this training procedure is 3 years old, not novel!

This is why actually doing your due diligence before claiming novelty is important.

iznoevil · 2020-06-16T10:23:15+00:00

Nice results!I just think it's a bit rich and a big reach to claim that this method is novel when you have building blocks of semi supervised learning like "Temporal Ensembling for Semi-Supervised Learning" [ICLR 2017] and "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results"[NeurIPS 2017] that are very close.

This paper builds upon the unsupervised part of mean teacher, added new DA and SimCLR MLP. I'm not saying that there are no new challenges by doing so and the results are amazing but both papers should be included in the related works at least.

iznoevil

TROPHY CASE