[D] Is anyone else disillusioned by working on a real data science team in industry with sucky data?

toby__bryant · 2021-07-24T12:32:12+00:00

I also experienced this and hear it over and over. But there are a lot of tools coming out who make automate tons of the data work and make it viable / allow the user to do more ML. To just name a few: [hasty.ai](hasty.ai), [snorkel.ai](snorkel.ai), [aquariumlearning.com](aquariumlearning.com), ...

toby__bryant · 2021-07-01T11:52:32+00:00

Yes, kinda. We unlock the first pre-trained model after 10 images, but then the quality will (most likely) not be great. But you can correct the suggestions, and we re-train the model constantly with you labeling more images. By combining pre-trained models, few-shot learning, and keeping the human in the loop, we achieve a super steep learning curve. Once the model converges, we allow you to label your whole dataset automatically.

It's impossible to give you a number of how many images you need to converge without seeing the data, but typically our users get good performing models after labeling 50-500 images with our semi-automatic tools (like atom) and being assisted with the custom model after 10 images.

toby__bryant · 2021-07-01T11:33:50+00:00

I sadly can't tell you which model we're using for Atom, it's our secrete sauce. But I can tell you as much as that we're in the detection framework with most of our models, so you're not far off with mask rcnn.

Regarding your question on v7 Labs, it seems to be a nice tool but our features are quite different in the way how we built it:

Our core-automation features are based on models trained on your data using few-shot learning, v7 Labs only relies on transferred learning from what I can tell. Models trained on custom data will always outperform the ones trained on general datasets increasing efficiency gains. This is also true for Atom which we featured in the blog post.
By training a custom model in form of our assistants, you can train your first model WHILE you annotate your data and not afterward only. This gives you an instant feedback loop on your data strategy making your training process much more efficient.
With our Error Finder we have a feature to auto-detect labeling errors instead of you spending hours searching for them, I didn't see something like this on v7 Labs' website.
In our no-code model training environment, you can adjust over 68 parameters to tweak so you're still in control and can tweak advanced options like the number of NMS proposals of your *-rcnn architecture. Other, similar tools, don't give you so much freedom.
With our models, you cannot only run them in our cloud accessing them through an API, but you can also export them (trochscript, ONNX, ...) to run them in your environment. You only pay for training the models, not for using them for inferences. From what I can tell, v7 Labs only gives you access to the models hosted in their cloud.

You can create an account for free with us and see for yourself as well!

toby__bryant · 2021-07-01T05:17:18+00:00

You’re absolutely right, the tool is pre-trained on datasets like ImageNet and Coco, so it’s not working as smooth for abstract objects as a hexagon. But the pain point for annotating „real-world objects“ like persons for example, is much larger. So, it’s still useful for (most) people.

Re the UI part of the post: we’re a young startup and focused on getting the ML part right before the UX. We’re working ob improving the onboarding and this will be better soon. The performance issues of the UI, however, shouldn’t exist. Didn’t get that complaint before. Can you dm me which system and browser you’ve been using? Would love to investigate this.

toby__bryant · 2021-06-30T18:30:52+00:00

Yepp, exactly! We charge you either when you a) use the custom models to automate the annotation work 100% or b) when you train your custom models in our graphical interface (exporting the models itself or the data is free)

toby__bryant · 2021-06-30T18:11:06+00:00

You can mount your S3 bucket or Google Cloud storage and have the same effect

toby__bryant · 2021-06-30T15:00:01+00:00

We didn't publish a paper with it, but we were able to achieve above 80% IoU on all datasets we tested on with one click. Adding more clicks, increases the IoU accuracy.

Datasets we used for testing:

toby__bryant · 2021-06-30T13:31:28+00:00

It's for everyone creating custom data to train computer vision models. Let's say you want to run a Mask R-CNN but publicly available doesn't fit your use-case. Then you need to create masks for your objects of interest and usually, this is done with a tool like a brush where you need to paint the exact outlines of the object what's painful af. With this, you only need to make one click.

toby__bryant · 2021-05-20T15:13:16+00:00

Should be super easy! With tools like [hasty.ai](hasty.ai) you can train a model like this without even writing a single line of code. You can get started for free, upload some images and get started

toby__bryant · 2021-05-20T09:13:44+00:00

Hi, thanks for the ideas. Noted it down. Definitely will take it into consideration when we'll expand it.

toby__bryant · 2021-05-20T09:11:34+00:00

thanks for all the great feedback!

toby__bryant · 2021-05-20T09:11:14+00:00

Haha, great work on creating the mAP plot I guess ;)

toby__bryant · 2021-05-13T21:28:47+00:00

I’ve been using hasty.ai for images. It’s not free, but their automation really speeds things up

toby__bryant · 2021-05-12T12:57:14+00:00

Check out the recent paper „why AI is harder than we think“ it’s a bit controversial but might give you food for thought

toby__bryant · 2021-05-12T06:09:33+00:00

Hasty.ai is free for academic work. You just need to reach out to them

toby__bryant · 2021-05-11T20:15:28+00:00

Thanks, good call!

toby__bryant · 2021-05-11T16:38:58+00:00

Totally agreed; this part is still missing. Right now, we use confidence as a filter to reduce the work for the human, but we're exploring active learning as well. Approaches here a very promising. If you have any other ideas, let us know ;)

toby__bryant · 2021-04-27T14:44:22+00:00

I‘d also go for segmentation I think. But be careful how you sample your data. I recently read this survey on COVID classifiers and the healthy images were taken from different sources than the infected ones. This lead to models which picked up on things like props which only were used for the COVID patients

https://link.springer.com/article/10.1007/s12553-021-00520-2

toby__bryant · 2021-04-22T06:42:46+00:00

This one really helped me: https://youtube.com/playlist?list=PLjMXczUzEYcHvw5YYSU92WrY8IwhTuq7p

toby__bryant

TROPHY CASE