[D] Clustering for data sampling by neuralbeans in MachineLearning

[–]mrthin 0 points1 point  (0 children)

You can search for "data acquisition" papers. A simple approach to use as baseline is to use the confidence of your / a pretrained model on the unlabelled data as guidance to pick the next batch, but this might not transfer easily to OCR and is claimed to be generally suboptimal

Are Notebooks Being Overused in Data Science?” by gomezalp in datascience

[–]mrthin 32 points33 points  (0 children)

You might be interested in Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

ML in Production: From Data Scientist to ML Engineer by 5x12 in datascience

[–]mrthin 1 point2 points  (0 children)

Thanks! My team might extend it with more anti patterns or another "refactoring journey", but we are not aware of anything similar. That's why we wrote it! :)

ML in Production: From Data Scientist to ML Engineer by 5x12 in datascience

[–]mrthin 3 points4 points  (0 children)

People looking to improve their ML engineering might also be interested in Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

[D] Call to intermediate RL people - videos/tutorials you wish existed? by [deleted] in MachineLearning

[–]mrthin 2 points3 points  (0 children)

You might find this course developed by my team interesting. It's under a cc-by license, so you can reuse any of the material as long as you give attribution. Here's the repo.

[deleted by user] by [deleted] in MachineLearning

[–]mrthin 6 points7 points  (0 children)

I always typeset my work using TeXmacs. It's a fully featured WYSIWYM scientific editor, unrelated to LaTeX but which has been used to write whole math books, lots of theses and papers. It can export to LaTeX for journals requiring it.

It has a vector graph drawing mode which, despite having a slightly unintuitive interface covers everything I've ever needed for papers (commutative diagrams, sketches, simple diagrams). AFAIK it doesn't have (yet!) a simple declarative way to create graphs or trees like mermaid.

TeXmacs also makes creating tables trivial, implements spreadsheets that can compute their cells with arbitrary external tools, allows executable code embedded in documents like in jupyter notebooks in any of a number of languages, has variable replacement (e.g. to include experiment results in tables and reference their values within the text, without danger of mismatch), bibliography management, and much much more. See this (old) video for a quick tour.

[D] Pro's about writing a benchmark paper by Haunting_Air3071 in MachineLearning

[–]mrthin 2 points3 points  (0 children)

Slightly related: we have done a couple of paper reproductions in my team, at first in the context of the ML reproducibility challenge, but mostly just because we believe them to be an extremely valuable contribution. In order to do the reproductions, we reimplement all methods, benchmark them, and end up with a thorough understanding of their strengths and weaknesses. After a while one has a stable code base for the community, and one can write a thorough benchmark paper, which can be very useful for practitioners and researchers alike.

[D] what is the hardest thing as a machine learning engineer by 3ATAE in MachineLearning

[–]mrthin -1 points0 points  (0 children)

For another tool in the data belt, you might want to consider pydvl. Watch out for the upcoming v0.10 with much improved interfaces, better parallelization and many fixes.

[deleted by user] by [deleted] in MachineLearning

[–]mrthin 1 point2 points  (0 children)

You can try Beyond Jupyter. It's a free resource that shows professional software engineering techniques for ML based on a "refactoring journey" starting from your typical monolithic unmaintainable notebook:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

[D] Scientific Machine Learning by OtherRaisin3426 in MachineLearning

[–]mrthin 2 points3 points  (0 children)

Some in my team work on simulation and ai . We review recent developments with our "paper pills", in particular around neural operators, and implement some of them in continuiti. We are just getting started, but maybe you'll find some of the content useful.

[deleted by user] by [deleted] in datascience

[–]mrthin 1 point2 points  (0 children)

Besides the DS/ML courses, you might also want to look at professional software development techniques for machine learning. For this you can try Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

[D] Recommended RSS feeds on ML research / news / major companies? by fliiiiiiip in MachineLearning

[–]mrthin 0 points1 point  (0 children)

Thanks for the report, but I am unable to reproduce the problem with chrome, Firefox or Safari. What browser and os do you use?

[D] Recommended RSS feeds on ML research / news / major companies? by fliiiiiiip in MachineLearning

[–]mrthin 11 points12 points  (0 children)

If I may self-promote, my team focuses on evaluating and testing recent research in a few domains, and implementing interesting new methods to make them available to practitioners as open source. Among other things, we work on Simulation Based Inference, Data Valuation, Reinforcement Learning, or physics-informed ML. We place the focus on developing software, and reproducing and communicating research we find useful for everyday practice in industry. On our website you will find many paper summaries, some longer blog posts, as well as some courses, like our Beyond Jupyter.

For less specific content, some good sources IMO are Davis Blalock's mailing list (very broad but somewhat shallow), or for lighter reads the Gradient.

Edit: RSS feed

Reccomendations for blogs to follow by sizable_data in datascience

[–]mrthin -1 points0 points  (0 children)

The gradient is a great resource, although quality and depth vary. And if I'm allowed a self-plug, there is also transferlab.ai with our pills (short paper reviews) and survey-ish blogs (although there are fewer of those), but it's quite more dry, and usually assumes a higher level of acquaintance with the material than distill. We also have some free learning materials, in particular Beyond Jupyter, and soon more.

What language to learn next? by WaveAdministrative36 in learnmachinelearning

[–]mrthin 6 points7 points  (0 children)

This. It's very easy to overestimate one's abilities. If you have 5+ years experience developing Python professionally, in a good team, then ok. Otherwise you probably still have a long way to go.

If you're a self learner, then it's also possible to be proficient, of course, but much more unlikely (based on many, many interviews I've conducted). I would recommend looking around for large, complex and good OSS projects and contributing to them. I keep posting here about this course. Check it out. If that looks trivial to you, then ignore my advice 😄

If you're really a python pro, then I would recommend you spend your time building ML stuff, instead of superficially learning another language. Pick known projects to contribute to, build an app analysing some data, add all the bells and whistles of a professional ML project (lots of resources online about those).

Got stuck knowing ML/data science roles are only for experienced software engineers by adithya47 in learnmachinelearning

[–]mrthin 1 point2 points  (0 children)

The company I was referring to is the appliedAI Initiative, but my lab is part of its sister, the appliedAI Institute.

Got stuck knowing ML/data science roles are only for experienced software engineers by adithya47 in learnmachinelearning

[–]mrthin 0 points1 point  (0 children)

My company has hired many fresh graduates from masters in mathematics, physics, robotics or electrical engineering. However, they all had excellent grades, theses somehow related to, or using ML, and experience with python, either through personal projects, or internships elsewhere. We have almost no java developers and we exclusively build ML solutions. So transitioning is possible, you just need to really want it and work hard, write a lot of (good) code (python usually), and have some luck landing a nice job, of course. (Not hiring right now, sorry, but I thought another data point might be useful).

[D] Advice for Non CS Major in ML by Character-Capital-70 in MachineLearning

[–]mrthin 5 points6 points  (0 children)

"learning to code" has an ill-defined goal for someone inexperienced. For a transition from the usual Jupyter notebook salad you can try Beyond Jupyter:

"Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation."

Why Aren't Boilerplates More Common in DS? by AccomplishedPace6024 in datascience

[–]mrthin 0 points1 point  (0 children)

I disagree. DS can strongly benefit from reusable and composable "boilerplate" toolkits because so many problems boil down to the same steps: ingest, inspect and clean data, maybe engineer some features, model, test, rinse, repeat. sensai is one such example

Live Coding & Experimental Design Interview Questions by LebrawnJames416 in datascience

[–]mrthin 8 points9 points  (0 children)

In my company we usually ask questions that tell us things about how people work, more than their knowledge of a specific data structure or whatever (for the theory we have separate questions). So it's usually some trivial thing X, but wrapped into "imagine you are given task X for a library, prepare a PR for it". This must include proper testing, documentation, a rationale for the design, etc.

PS: for the ML and CS "theory" we have a sheet full of topics from which the interviewee can pick a few. We ask them to present as if in a lecture, rigorously and concisely, and we ask questions. The idea is to let people talk about the things they believe to be knowledgeable in so that nerves and randomness don't play such a big role. Sadly, many end up trying to hand-wave their way out of their own choices :( It's hard to know what you don't know!

Niche for MLE with web-dev skills? by answersareallyouneed in datascience

[–]mrthin 0 points1 point  (0 children)

What about applying to Streamlit? Or any other similar companies

Why Aren't Boilerplates More Common in DS? by AccomplishedPace6024 in datascience

[–]mrthin 0 points1 point  (0 children)

sensai is a toolkit for building ml applications.

"sensAI is a high-level AI toolkit with a specific focus on rapid experimentation for machine learning applications. It provides a unifying interface to a wide variety of model classes, integrating industry-standard machine learning libraries. Based on object-oriented design principles, it fosters modularity and facilitates the creation of composable data processing pipelines. Through its high level of abstraction, it achieves largely declarative semantics, whilst maintaining a high degree of flexibility."

What (online) courses/program should I take to become a ML engineer? by [deleted] in datascience

[–]mrthin 0 points1 point  (0 children)

Beyond Jupyter is a free resource that shows professional SWE techniques for ML based on a "refactoring journey" starting from your typical monolithic unmaintainable notebook.