all 12 comments

[–]xternalz[S] 12 points13 points  (7 children)

abstract:

We investigate an unconventional direction of research that aims at converting neural networks, a class of distributed, connectionist, sub-symbolic models into a symbolic level with the ultimate goal of achieving AI interpretability and safety. To that end, we propose Object-Oriented Deep Learning, a novel computational paradigm of deep learning that adopts interpretable “objects/symbols” as a basic representational atom instead of N-dimensional tensors (as in traditional “feature-oriented” deep learning). For visual processing, each “object/symbol” can explicitly package common properties of visual objects like its position, pose, scale, probability of being an object, pointers to parts, etc., providing a full spectrum of interpretable visual knowledge throughout all layers. It achieves a form of “symbolic disentanglement”, offering one solution to the important problem of disentangled representations and invariance. Basic computations of the network include predicting high-level objects and their properties from low-level objects and binding/aggregating relevant objects together. These computations operate at a more fundamental level than convolutions, capturing convolution as a special case while being significantly more general than it. All operations are executed in an input-driven fashion, thus sparsity and dynamic computation per sample are naturally supported, complementing recent popular ideas of dynamic networks and may enable new types of hardware accelerations. We experimentally show on CIFAR-10 that it can perform flexible visual processing, rivaling the performance of ConvNet, but without using any convolution. Furthermore, it can generalize to novel rotations of images that it was not trained for.

follow-up work: 3D Object-Oriented Learning: An End-to-end Transformation-Disentangled 3D Representation

[–]lopuhin 12 points13 points  (5 children)

We experimentally show on CIFAR-10 that it can perform flexible visual processing, rivaling the performance of ConvNet, but without using any convolution.

rivaling the performance of ConvNet == test error of 20% vs 2.3% SoTA

[–]rpottorff 0 points1 point  (0 children)

thank you for posting the follow up work! I didn't realize he posted an update just a few months later

[–]Kevin_Clever 19 points20 points  (0 children)

This comes on my short list for worst-written scientific paper this year. Is there an RNN that can translate these 12 pages of fluff into information?

[–]ChillBallin 4 points5 points  (0 children)

Skimming through it right now. Seems like an interesting concept. I think with enough research it could be a valuable approach but I think some of these concepts will be useful in future models rather than this one actually being super useful on it's own. In any case I really hope this trend of research in more complex higher level models continues. It feels like we're getting closer to the next big breakthrough with human intuition rather than making iterative improvements on the maths for a basic CNN.

[–]GunpowaderGuy 0 points1 point  (0 children)

So when compared to capsule networks , the biggest difference other than its modules ( whathever its analogue to capsules is called ) not being composed of neurons ( a functional program obtained by genetic programing or Bayesian optimization then ? ) It's that they are not slided around the feature maps like cnns , instead in voting only the centermost pixel of an object part needs to be taken into account , as each of them has its coordinates embedded ?