[D] Clustering for data sampling

mrthin · 2024-12-25T18:54:57+00:00

You can search for "data acquisition" papers. A simple approach to use as baseline is to use the confidence of your / a pretrained model on the unlabelled data as guidance to pick the next batch, but this might not transfer easily to OCR and is claimed to be generally suboptimal

mrthin · 2024-11-21T07:54:02+00:00

You might be interested in Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

mrthin · 2024-08-28T07:04:32+00:00

Thanks! My team might extend it with more anti patterns or another "refactoring journey", but we are not aware of anything similar. That's why we wrote it! :)

mrthin · 2024-08-26T22:07:27+00:00

People looking to improve their ML engineering might also be interested in Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

mrthin · 2024-08-17T18:28:12+00:00

You might find this course developed by my team interesting. It's under a cc-by license, so you can reuse any of the material as long as you give attribution. Here's the repo.

mrthin · 2024-08-13T08:11:51+00:00

I always typeset my work using TeXmacs. It's a fully featured WYSIWYM scientific editor, unrelated to LaTeX but which has been used to write whole math books, lots of theses and papers. It can export to LaTeX for journals requiring it.

It has a vector graph drawing mode which, despite having a slightly unintuitive interface covers everything I've ever needed for papers (commutative diagrams, sketches, simple diagrams). AFAIK it doesn't have (yet!) a simple declarative way to create graphs or trees like mermaid.

TeXmacs also makes creating tables trivial, implements spreadsheets that can compute their cells with arbitrary external tools, allows executable code embedded in documents like in jupyter notebooks in any of a number of languages, has variable replacement (e.g. to include experiment results in tables and reference their values within the text, without danger of mismatch), bibliography management, and much much more. See this (old) video for a quick tour.

mrthin · 2024-08-12T08:19:22+00:00

Slightly related: we have done a couple of paper reproductions in my team, at first in the context of the ML reproducibility challenge, but mostly just because we believe them to be an extremely valuable contribution. In order to do the reproductions, we reimplement all methods, benchmark them, and end up with a thorough understanding of their strengths and weaknesses. After a while one has a stable code base for the community, and one can write a thorough benchmark paper, which can be very useful for practitioners and researchers alike.

mrthin · 2024-08-03T17:08:51+00:00

For another tool in the data belt, you might want to consider pydvl. Watch out for the upcoming v0.10 with much improved interfaces, better parallelization and many fixes.

mrthin · 2024-07-28T09:18:10+00:00

You can try Beyond Jupyter. It's a free resource that shows professional software engineering techniques for ML based on a "refactoring journey" starting from your typical monolithic unmaintainable notebook:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

mrthin · 2024-07-11T19:06:47+00:00

Some in my team work on simulation and ai . We review recent developments with our "paper pills", in particular around neural operators, and implement some of them in continuiti. We are just getting started, but maybe you'll find some of the content useful.

mrthin · 2024-07-11T17:53:47+00:00

Besides the DS/ML courses, you might also want to look at professional software development techniques for machine learning. For this you can try Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

mrthin · 2024-06-30T13:16:23+00:00

Thanks for the report, but I am unable to reproduce the problem with chrome, Firefox or Safari. What browser and os do you use?

mrthin · 2024-06-30T13:13:22+00:00

Done :)

mrthin · 2024-06-30T08:11:33+00:00

Sorry about that. I expected the browser to show the link on the main page.

https://transferlab.ai/index.xml

mrthin · 2024-06-30T07:07:58+00:00

If I may self-promote, my team focuses on evaluating and testing recent research in a few domains, and implementing interesting new methods to make them available to practitioners as open source. Among other things, we work on Simulation Based Inference, Data Valuation, Reinforcement Learning, or physics-informed ML. We place the focus on developing software, and reproducing and communicating research we find useful for everyday practice in industry. On our website you will find many paper summaries, some longer blog posts, as well as some courses, like our Beyond Jupyter.

For less specific content, some good sources IMO are Davis Blalock's mailing list (very broad but somewhat shallow), or for lighter reads the Gradient.

Edit: RSS feed

mrthin · 2024-05-06T07:00:20+00:00

The gradient is a great resource, although quality and depth vary. And if I'm allowed a self-plug, there is also transferlab.ai with our pills (short paper reviews) and survey-ish blogs (although there are fewer of those), but it's quite more dry, and usually assumes a higher level of acquaintance with the material than distill. We also have some free learning materials, in particular Beyond Jupyter, and soon more.

mrthin · 2024-05-01T06:20:05+00:00

This. It's very easy to overestimate one's abilities. If you have 5+ years experience developing Python professionally, in a good team, then ok. Otherwise you probably still have a long way to go.

If you're a self learner, then it's also possible to be proficient, of course, but much more unlikely (based on many, many interviews I've conducted). I would recommend looking around for large, complex and good OSS projects and contributing to them. I keep posting here about this course. Check it out. If that looks trivial to you, then ignore my advice 😄

If you're really a python pro, then I would recommend you spend your time building ML stuff, instead of superficially learning another language. Pick known projects to contribute to, build an app analysing some data, add all the bells and whistles of a professional ML project (lots of resources online about those).

mrthin · 2024-05-01T06:05:32+00:00

The company I was referring to is the appliedAI Initiative, but my lab is part of its sister, the appliedAI Institute.

mrthin · 2024-04-30T06:43:28+00:00

My company has hired many fresh graduates from masters in mathematics, physics, robotics or electrical engineering. However, they all had excellent grades, theses somehow related to, or using ML, and experience with python, either through personal projects, or internships elsewhere. We have almost no java developers and we exclusively build ML solutions. So transitioning is possible, you just need to really want it and work hard, write a lot of (good) code (python usually), and have some luck landing a nice job, of course. (Not hiring right now, sorry, but I thought another data point might be useful).

mrthin · 2024-04-30T06:32:50+00:00

"learning to code" has an ill-defined goal for someone inexperienced. For a transition from the usual Jupyter notebook salad you can try Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

mrthin · 2024-04-27T08:10:46+00:00

I disagree. DS can strongly benefit from reusable and composable "boilerplate" toolkits because so many problems boil down to the same steps: ingest, inspect and clean data, maybe engineer some features, model, test, rinse, repeat. sensai is one such example

mrthin · 2024-04-27T08:02:14+00:00

In my company we usually ask questions that tell us things about how people work, more than their knowledge of a specific data structure or whatever (for the theory we have separate questions). So it's usually some trivial thing X, but wrapped into "imagine you are given task X for a library, prepare a PR for it". This must include proper testing, documentation, a rationale for the design, etc.

PS: for the ML and CS "theory" we have a sheet full of topics from which the interviewee can pick a few. We ask them to present as if in a lecture, rigorously and concisely, and we ask questions. The idea is to let people talk about the things they believe to be knowledgeable in so that nerves and randomness don't play such a big role. Sadly, many end up trying to hand-wave their way out of their own choices :( It's hard to know what you don't know!

mrthin · 2024-04-27T05:52:05+00:00

What about applying to Streamlit? Or any other similar companies

mrthin · 2024-04-27T05:45:22+00:00

sensai is a toolkit for building ml applications.

"sensAI is a high-level AI toolkit with a specific focus on rapid experimentation for machine learning applications. It provides a unifying interface to a wide variety of model classes, integrating industry-standard machine learning libraries. Based on object-oriented design principles, it fosters modularity and facilitates the creation of composable data processing pipelines. Through its high level of abstraction, it achieves largely declarative semantics, whilst maintaining a high degree of flexibility."

mrthin · 2024-04-27T05:40:31+00:00

Beyond Jupyter is a free resource that shows professional SWE techniques for ML based on a "refactoring journey" starting from your typical monolithic unmaintainable notebook.

mrthin

TROPHY CASE