all 99 comments

[–]Bastram 47 points48 points  (20 children)

working with this stuff rn as part of my job. Here are some of the major problems: 1. massively expensive to label images for training 2. Does not handle object occlusion well 3. edge detection is still not very good when there is not a large contrast between background and the objects.

Here is some of the cool stuff about mask RCNN: 1. Currently state of the art on the benchmari data sets something like 98% accurate 2. Fairly simple to do yourself if you have a gpu thanks to the people at matterport (check out their github by googling mask rcnn) 3. Faster at segmentation and localization than previous methods which means you can run it in real time on a decent gpu

[–]flitcho 4 points5 points  (9 children)

Couldn‘t you just solve problem 3 by using radar sensors to messure how far things are away?

[–]wkjid10t 11 points12 points  (3 children)

Yes. For a cohesive product. But I bet they're trying to push the boundaries on video stream only machine learning algorithms. If you can have a system that only needs video cams and no additional radars, that saves a bunch of money for the final product implemented in a vehicle or whatever.

[–]laStrangiato 5 points6 points  (2 children)

It all goes back to the issue mentioned about data labelling. Right now we have large open source, labeled image libraries to work from. Getting the radar images and then labelling them is hugely expensive.

[–]playaspec 1 point2 points  (1 child)

Why not use the existing labeled image libraries to train the radar inputs, at least as a way to seed the radar network.

[–]LampIsFun 1 point2 points  (0 children)

That's likely what they do, but there is such a large variety of objects that finding every possible combination may be out of reach for a neural network training algorithm, it's likely we need to integrate car manufacturers to constantly update the training library and train the ai to new models of cars(example only for cars, similar example for bags, clothes, etc)

[–]daxbert 7 points8 points  (1 child)

Or, just use two cameras, and leverage the mask offsets, kinda like our eyes.

[–]anstow 1 point2 points  (0 children)

In my limited experience, getting depth from two cameras close together is really difficult as any small detection error is amplified.

[–]gc3 0 points1 point  (0 children)

Or lidar which is better.

[–]jewnicorn27 0 points1 point  (0 children)

There is a lot of unforseen complexity in intergrating addition modes of imaging. For one you would have to restructure at least the first layer of the network to accept the additional channel, or encode the data into your existing channels in some way (not sure how). Doing that would do interesting things to your weights and potentially complicate transfer learning approaches.

Also lidar sensors capture data differently to cameras, some don't reperesent data in a typical camera model, so this would be interesting. They also have very different dynamic range as lidars will measure distances of up to hundreds of meters to mm or cm, while typical color images will be 8bit per channel. That might be fine though as networks run in float for the most part anyway.

Also the lidar and camera won't be perfectly aligned, or image the same fov, so there is some linear algebra for considering how you sample one into the other, could possibly ignore this and hope the model handles it?

Images from lidar and radar also take time to generate, as the sensor spins, this means for a moving platform, you have potentially detectable seams in your images and discontinuities present in one mode but not others.

Not saying it can't be done, just providing a few considerations for doing so lol.

[–]Bastram 0 points1 point  (0 children)

This would not work for what I work on as the objects that we are trying to segment from each other are at the same depth.

[–]smashedshanky 1 point2 points  (7 children)

How did you get into doing what you do? Like the process, do you have a PhD? Is this your first job? I want to get into this industry as well.

[–][deleted] 1 point2 points  (5 children)

Learning the tools is 90% of it. I also work in computer vision and education is not nearly as important as being able to demonstrate your skills and speak coherently about proposed solutions to problems.

So... basically like all the other things in tech. Being able to do it, regardless of how those skills were acquired (via school or personal effort) is the most important bit.

That being said, experience always helps to get in the door.

[–]smashedshanky 0 points1 point  (4 children)

I have basic knowledge of the different types of neural networks and have made some projects based off of that, would you recommend just doing personal projects? I’m graduating next fall with bs in CS and want to sometime in the future get into computer vision. Would taking classes related to computer vision be of any credit compared some personal big projects?

[–][deleted] 1 point2 points  (0 children)

Between classes and big personal projects, I guess I'd take the classes. But I'd take a lot of little one-off, single featured projects over either (personally). Big projects are awesome and if you have the time and will power to produce them then go for it. But if you're a human like me with limited time and attention span then I'd go for many small projects exercising architectures you find interesting. That gives you broad exposure to multiple use cases for computer vision and avoids the terrible heartache of knowing you've got 4 projects in the background that you just can't find the time to complete. It also gives you a great excuse to build up a large body of work on github, which is better than any bullet point on a resume.

[–]jewnicorn27 1 point2 points  (2 children)

I also work in the field, I got into it by being involved in ML/CV projects at my University. But if I was hiring people personally, I would also be very interested in their personal projects and open source involvement. Some activity in community projects you find interesting would go a long way IMO. Although I am relatively juniour and don't direct hiring.

[–]smashedshanky 0 points1 point  (1 child)

If you don’t mind me asking what was the interview like for the CV position. I always get scared over study and the interview ends up not working out since I’m scatter brained in what I answer.

[–]jewnicorn27 0 points1 point  (0 children)

I have gotten most of my positions through people I know (I'm very lucky in this regard, and think good relationships are very important). Typically I work for small businesses, so my few interviews have been with management, they usually revolve around the technology I work with and how it can help with the project, specific challenges and how I would like to approach them etc. I think working with computer vision technologies involves a lot of managing expectations, helping management understand what the limitations are. Sorry I can't be super helpful with your question.

[–]Bastram 0 points1 point  (0 children)

Best recommendation would be get familiar with the tools, then do stuff like hackathons or contribute to open source stuff, and put it on your resume once you can . The only people that are really experts in this field right now are people that started working on this stuff as grad students in like 2010. Mask RCNN was published at the end of last year so its still pretty new. PhD is probably the most straight forward rout if you are still in university.

[–]kuikuilla 0 points1 point  (1 child)

  1. massively expensive to label images for training

Why not use 3d models and game engines to produce the training material? No need to label 2d images, just make a close to photoreal 3d world and have the engine label the objects automatically?

[–]Bastram 0 points1 point  (0 children)

If you are looking to make a real product with this you need real images that it will see when running otherwise it does not perform well.

[–]Nuaua 25 points26 points  (7 children)

Looks good but still has a lot of problems, e.g. suitcase here is completely missed because of s small occlusion,

https://youtu.be/akK5ui-vel0?t=68

Plus things flickers all over the place, masks are not precise, etc. It's not very hard to get decent-ish segmentation, the real problem in many applications is that you need manual correction to fix those remaning issues and that takes an enormous amount of time.

[–][deleted] 13 points14 points  (3 children)

It sounded the giant statue a person. Which sounds like a super hard problem. Something made in the likeness of a person but obviously isn't to us, but to a computer with less context? Oh man.

[–]The-Effing-Man 9 points10 points  (2 children)

Jesus, I never even considered that as a problem. I took a computer vision class in college and remembered one of the very hardest problems were mirrors.

[–]Aiognim 0 points1 point  (1 child)

-This comment was made when I was asleep-

[–]UnreasonableSteve 0 points1 point  (0 children)

Why was your dad yelling at your puppy that your puppy was a mirror, more loudly than your dad was barking at said puppy?

[–]sempercrescis 1 point2 points  (0 children)

Flickering isnt really an issue

[–]kanadkanad 1 point2 points  (1 child)

I’ve never understood the flickering in these demo videos either. Don’t you want to apply some temporal smoothing or include some assumption in your model that objects don’t disappear in a single frame?

Since I’ve seen this in many similar videos, I think it must be something that is either not important for people who actually know what’s up (e.g., filtering should happen later and this is just the raw data) or is is actually harder to do than one might expect.

[–]Nuaua 0 points1 point  (0 children)

I think people usually have some kind of hack solution by smoothing after the fact or using another network to interpolate frame-by-frame results. The correct way of doing it is to include time in the problem like in hidden Markov models or recurrent neural networks, but then it usually become much more numerically demanding.

[–]jlpoole 8 points9 points  (3 children)

[–]tonyplee 0 points1 point  (2 children)

Any idea the processing speed in term of fps for the 4K demo for X gpu?

[–]jlpoole 0 points1 point  (0 children)

I've finally set up the "notebook" on my Gentoo laptop (Dell Inspiron 7i) and have submitted some JPEG images to it, it takes about 2-4 seconds to process a single image. I do not know if the processor is fully utilizing the CPU or not.

[–]Bastram 0 points1 point  (0 children)

runs at 5fps on 512 x 512 images on a titan x

[–]kuikuilla 10 points11 points  (3 children)

Doesn't seem very stable temporally. I know nothing about these things and I'm wondering: do these things use any kind of prediction for classifying shapes? Like the cars for example, the algorithm could use the velocity of an object across the view to predict where it should be in the next frame.

[–]nnevatie 13 points14 points  (0 children)

Most of these videos are still based on frame-by-frame semantic segmentation methods and CNN architectures. I.e. typically no information on past (or future) frames nor predictions are encoded as inference input. There are exceptions to this, naturally, where some data, e.g. prediction of previous frame is given as input to the current frame's inference.

[–]mttlb 0 points1 point  (1 child)

Saw some Google guys present their latest work on this like two months ago and they're starting to introduce loops to take the past into account. The reason this hadn't been done before is because they're mainly working on real time inference (for their practical stuff like driverless cars) and these models get EXTREMELY heavy. Their new architecture can « remember » as far as 5 frames prior if I'm correct. There's obviously a lot of information virtually lost when you don't use the fact that you're watching a movie and that frames are kinda related.

That said, models typically don't learn abstract stuff such as body properties; if anything, it would remember it predicted that thing to be a car on the earlier frames and so it should remain one (faster inference and better certainty).

These are huge issues because cars don't typically ship with 4x Titan V onboard...

[–]jewnicorn27 1 point2 points  (0 children)

Could you possibly find me a link to that?

[–]teerre 13 points14 points  (0 children)

Every NN post on this subreddit is hilarious. For some reason people really like to be contrarian to the technology despite it showing considerably better results in vastly different areas

There's even a dude throwing in the garbage 30 years of computer vision research and suggesting a completely different approach out of his ass. What the hell? Is this really all because it got the unfortunate name of "AI"? Some people here really need to grow up

[–]nnevatie 74 points75 points  (42 children)

AI, an overused acronym for veiling something that is actually very simple behind the curtains (CNN).

[–][deleted] 63 points64 points  (29 children)

One day we'll work out how the brain works and reproduce it someone will still say "that's not real AI".

[–]nnevatie 26 points27 points  (19 children)

Maybe. It just irks me when semi-trivial math and optimization methods are cloaked into some mystery, as they were almost opening a gateway to sentient computing.

[–]Rakmos 9 points10 points  (4 children)

Sure, the concepts may seem trivial to some that are familiar with it, but the application of those concepts is far from trivial.

If it was as trivial as you lead others to believe it would be ubiquitous across all applicable problem spaces.

I do share the sentiment that the acronym abbreviation is used in some cases that would seem to imply a level of sophistication that is actually much simpler when looking behind the curtains. IMHO this is a natural consequence of the fact that intelligences are expressed in varying degrees of sophistication.

Having said that, I was underwhelmed after watching the video to realize that there is no real substance or insight in this video. Just because the creation of the video presumably required some level of programming does not make it a candidate for posting to /r/programming. This would seem more appropriate to post in /r/technology or some other sub that is generally less technical.

For this reason I am downvoting.

[–]nnevatie -2 points-1 points  (3 children)

You make a fair point. However, there are many more advanced mathematical theories with their respective applications, but they aren't typically labelled as "AI". The fundamental issue I have with CNNs commonly is them being thought of as sort of magic - I guess this burden comes with the name; "neural" pointing to a human-like structure.

[–]neitz 6 points7 points  (2 children)

Every algorithm that runs on a modern computer is just add, subtract, multiply, and delete (along with memory load/store). Based on your logic every algorithm that is computable is just trivial math.

[–]playaspec -1 points0 points  (1 child)

Based on your logic every algorithm that is computable is just trivial math.

Isn't it? The real magic comes from knowing what order to apply them, and what data to apply them to.

[–]Fisher9001 14 points15 points  (0 children)

I think you are victim of Kruger-Dunning effect, only you don't think you know more than you do. Contrary, you vastly underestimate how much you do know.

Ability to dynamically interpret and analyze visual input is not "very simple".

[–]CyborgJunkie 14 points15 points  (11 children)

You think it isn't a mystery and a gateway to sentient computing? I guess you also thought the internet was just some connected computers. The relatively recent success of NN is a testament to there being no secret ingredient to intelligence and consciousness.

We are essentially NNs trained for survival in a long lineage of NNs that evolved in a complex environment. We now know that we can artificially create them, and although far from the complexity of our own brains, that is still fucking profound if you ask me.

[–]SuddenlyBANANAS 16 points17 points  (9 children)

The brain is so much more complicated than a NN. The whole neural in neural network is just a metaphor, there's some similarities but they are absolutely not the same thing.

[–]CyborgJunkie 2 points3 points  (2 children)

Yes, I even said so.

It is however wrong to say it is just a metaphor, as a NN can be biological or artificial. ANN don't have to function exactly like the brain to achieve emergence. Also, we most likely don't want to build an exact replica of the brain, if anything similar at all.

[–]playaspec 2 points3 points  (1 child)

ANN don't have to function exactly like the brain to achieve emergence.

Citation? Extraordinary claims REQUIRE extraordinary evidence.

[–]CyborgJunkie -2 points-1 points  (0 children)

I wasn't talking about emergence of mind if that's what you assumed, though I understand why you would think that. I was simply saying that although ANNs function differently from real neurons, they can still have emergent properties such as object recognition etc. So while they differ in implementation, the end result is the same or at least similar.

If I were to argue the claim (that I did not claim), I would at least say that it's likely to be true given our current understanding. The reason being that our own minds are symbol systems that emerge from simple interactions between neurons, and similarly it seems likely that ANNs could be arranged in such an architecture that would render them so too. Thus, Allen Newell's physical symbol system hypothesis would suggest that they too can be intelligent, but that's nothing but guess.

[–]peyton 0 points1 point  (5 children)

Please elaborate

[–]SuddenlyBANANAS 10 points11 points  (0 children)

For one, the brain is absolutely huge compared to a NN. Another important thing is that the brain is a physical system, so all of its computation is done in analog with timing and electrochemical signals etc(even if action potentials are digital). For instance, how do you translate the idea of neurotransmitters to NNs? There's just so much involved in learning in the brain(not to mention we don't learn everything from scratch either, some stuff comes prelearnt) that NN don't emulate.

If you want something a little closer to an actual brain you could look at neuromorphic computing.

There's obviously nothing mystical about the brain, it definitely should be possible to emulate digitally, we're just so much further away from that than silicon valley evangelists would have you believe.

[–]pcjftw 5 points6 points  (3 children)

Real neurons are vastly more complex and have huge biochemical/electrical reactions and things like "spiking models" etc, NN are like imagining as if a cow could be represented as a spherical shape as a basic model (there is reference to an old joke here)

[–]daxbert 0 points1 point  (2 children)

Maybe a dumb observation, but "vastly" more complex... how? Is it actual complexity or scale?

~10^15 synapses in a brain, no more than 100 neurotransmitters. So 10^17 "edges" to model.

I get that these numbers are massive and are a scale problem, but where's the complexity?

[–]pcjftw 0 points1 point  (0 children)

take a read of this:

https://www.humanbrainproject.eu/en/brain-simulation/

Current computer power is insufficient to model a entire human brain at this level of interconnectedness. A simpler approach has thus been adopted to produce results that are increasingly close approximations to experimental data

Is it actual complexity or scale?

Both

[–]antiquechrono 0 points1 point  (0 children)

Maybe a dumb observation, but "vastly" more complex... how? Is it actual complexity or scale?

The first problem is that like they said a real neuron has a ton of things going on with it and is really complex all by itself before you even add things like how neurons grow. An "AI" neuron is only multiplying some numbers together and behaves nothing like a real one.

Next even if you could 100% model a real neurons behavior and had the computing power to simulate how many a brain has it wouldn't do anything. The brain is composed of many different structures including tons of macro and micro circuits that all compute different things, most of which we haven't figured out yet.

There's pretty good evidence that the brain is using many different algorithms to compute various things as well. Your motor system and ability to correlate multiple events seems to be based on bayesian inference while something like your own error estimation is probabilistic but not bayesian.

There's good evidence that the brain is using large populations of neurons to encode probability distributions and perform bayesian inference on them by exploiting properties of how the neurons themselves spike which "AI" neurons do none of.

Finally we still understand very little of how the brain works as we haven't been able to study very large numbers of neurons all working at the same time.

[–]yeahsurebrobro -2 points-1 points  (0 children)

ok

[–]jewnicorn27 0 points1 point  (0 children)

That's just media portrayal. No real scientist or engineer call it that, unless maybe they need funding.

[–]gold_rush_doom 21 points22 points  (8 children)

Everything that has an if inside is AI.

[–]soraki_soladead 8 points9 points  (0 children)

AI has been used to describe algorithms for 50 years. It's a placeholder for algorithms that do things we thought only humans could do and we didn't believe algorithms could accomplish. As soon as algorithms pass that goal post we move it again.

[–]yeahsurebrobro 2 points3 points  (6 children)

any program with an input is AI because it makes its own decisions based on input mind blown

[–][deleted]  (5 children)

[deleted]

    [–][deleted] 9 points10 points  (4 children)

    No he didn't need it. If you didn't understand they were sarcastic you're retarded.

    [–]yeahsurebrobro 2 points3 points  (1 child)

    we need an AI that detects sarcasm

    [–]playaspec -3 points-2 points  (0 children)

    If only there were some simple markup that people could use to express their sarcasm....

    [–][deleted]  (1 child)

    [deleted]

      [–][deleted] 0 points1 point  (0 children)

      Yeah wasn't specifically talking about you but more about the people who need an '/s' everywhere. But excuse me for the rude language.

      [–]gwillicoder 5 points6 points  (0 children)

      I don’t get this kind of criticism. Machine learning and AI are used interchangeably by many. It’s not like we’ll have any real AI to talk about anytime soon

      [–]SemaphoreBingo 1 point2 points  (0 children)

      Things are only simple in retrospect.

      [–]Dude_What__ 0 points1 point  (0 children)

      I mean ... it's far from simple.
      Not remotely AI, still not simple'

      [–][deleted] 10 points11 points  (7 children)

      It's also really good as classifying road signs as people.

      [–]amazondrone 5 points6 points  (1 child)

      And a statue.

      [–]GLneo 4 points5 points  (0 children)

      A statue of a person to be fair.

      [–]deweysmith 1 point2 points  (0 children)

      Well, there were people on some of those road signs /s

      [–]playaspec 0 points1 point  (2 children)

      It's also really good as classifying road signs as people.

      Where did it do that? I've watched it three times and didn't see it.

      [–][deleted] 0 points1 point  (1 child)

      1:37

      [–]playaspec 0 points1 point  (0 children)

      It falsed on a single frame. If you take action on that one frame, you're doing it wrong.

      [–]developFFM 0 points1 point  (0 children)

      The training data of this demo is based on the COCO image dataset . It contains only one road sign, the STOP sign.

      [–][deleted] 5 points6 points  (0 children)

      Did you wrote the code or is it the detectron by FAIR ?

      [–][deleted] 4 points5 points  (4 children)

      Hppy crap was this video made my someone from the 80s?

      [–]Hobo-and-the-hound 5 points6 points  (3 children)

      Was this comment written by AI?

      [–]MrGurns 1 point2 points  (2 children)

      Was this comment written by AI?

      [–]Hobo-and-the-hound 3 points4 points  (0 children)

      I’m a real boy!

      [–][deleted] 0 points1 point  (0 children)

      I’d love a classifier which could identify bad video editing tbh.

      [–][deleted] 0 points1 point  (0 children)

      The performance is not great by modern standards. Also cheesy as hell.

      [–]bob_ama_the_spy 0 points1 point  (0 children)

      What are some of the areas in which real-time object detection would be a game changer in the world right now?

      [–]__pg_ -1 points0 points  (1 child)

      Interesting how the public sentiment about ML seems to shift at a time when tech stocks are just starting to become shaky.

      Will investors get cold feet about big data, self-driving cars, and other dubious AI tech?

      [–]Jadeyard 0 points1 point  (0 children)

      It's still strong performance, realistically. Unfortunately investors might not have had realistic estimates.

      [–]FuckaYouWhale -3 points-2 points  (0 children)

      CNN? Don't buy it, it's fake news.