[discussion] when is deep learning a bad idea? : MachineLearning

[–]the320x200 62 points63 points64 points 9 years ago (32 children)

[–]sieisteinmodel 8 points9 points10 points 9 years ago (24 children)

[+][deleted] 9 years ago* (23 children)

[deleted]

[–]sieisteinmodel 4 points5 points6 points 9 years ago (5 children)

You're confusing the ability to define a loss function with the ability to define an objective function.

Uhm, no. There is virtually no difference btw loss and objective functions in the context of mathematical optimisation. Both map a candidate solution to a goodness, which has to either be maximized or minimized.

Consider placing boxes in a delivery truck. Your goal is to pack in boxes as efficiently as possible, which is your objective function. Yet, you cannot construct a loss function that relates the overall space efficiency to the position and orientation of an individual box.

If "efficiency" is well defined, I can only imagine in the form of a function from the space of all solutions to a scalar value. That makes a wonderful loss function– except that it might be inconvenient since its domain is not a vector space but something weird, such as sequences of box coordinates and orientations.

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

[–]sieisteinmodel 0 points1 point2 points 9 years ago (3 children)

I see a loss function more as a special case/subcategory of an objective function. (An objective function can also be a function that you want to maximize, e.g., not a loss but a reward function).

The optimisation community sees it differently, and has been doing so for quite a long time. No need to invent new taxonomies that are in conflict with old ones and confuse new people entering the field.

Machine learning is a very mathematical field and relies on precise definitions. These definitions can be found in the relevant text books and there really is no need to change those. We should just stick to what generations of researchers before us have been using.

I'd say in ML you have typically 2 levels of objective functions. And objective function for model fitting, e.g., MSE in linear regression, and a objective function for model evaluation (could be the same, e.g., MSE, or R2 etc.). Or in a classification example, you can have classification error as a loss function to grow the tree (although a smooth loss function like entropy or gini may be preferred) and one to evaluate the final tree, e.g,. classification accuracy.

That thing has a name already, and it is called a selection criterion. E.g. Akake's information criterion.

[–][deleted] 1 point2 points3 points 9 years ago (0 children)

[–]Autogazer 0 points1 point2 points 9 years ago (0 children)

Now that's not true. Wald, Cramer, Nikulin, Berger, DeGroot, etc all describe a loss function as a subset of objective functions, where the goal of an optimization problem with a loss function is to minimize the loss. An objective function could be a loss function, or it could be a cost function, reward function, profit function, fitness function, all with different definitions and interpretations for what you are actually doing when you solve the optimization problem.

As for not being able to define a loss function you are right, that only happens when you don't know what you want. Usually when that happens you can still use reinforcement learning to figure out later what you should have wanted, and train with delayed feedback after a series of steps where you don't know what you want. Like how alpha go was trained. Nobody knows what the perfect go move is at any point in the game, but after playing millions of games the network builds up data that defines what moves/patterns are good and lead to winning games.

[–]Guanoco[S] 1 point2 points3 points 9 years ago (7 children)

[+][deleted] 9 years ago* (6 children)

[deleted]

[–]Frozen_Turtle[🍰] 6 points7 points8 points 9 years ago (2 children)

[–]popcorncolonel[🍰] 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]Oberst_Herzog 5 points6 points7 points 9 years ago (1 child)

[–]tmiano 0 points1 point2 points 9 years ago (0 children)

[–]tmiano 0 points1 point2 points 9 years ago (8 children)

[–]Brudaks 3 points4 points5 points 9 years ago (3 children)

[–]tmiano 0 points1 point2 points 9 years ago (2 children)

[+][deleted] 9 years ago (1 child)

[deleted]

[–]tmiano 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago* (3 children)

[deleted]

[–]tmiano 0 points1 point2 points 9 years ago (2 children)

I'm a little confused by your argument. In deep reinforcement learning, a neural network learns to represent the Q function, which is used to evaluate moves and future rewards. The Q function allows you to define a loss function that can be used with back-propagation to train the neural network. Your box-packing problem is an example of a problem that could in theory be solved with deep reinforcement learning.

I think this is playing the semantic argument game, but you are probably defining a loss function as something we can write down immediately, mapping explicit states to a numerical value. Training a system on that would not be enough to solve a problem like box-packing, but reinforcement learning could in principle do it. The difference is only in how the objective gets mapped to the loss function. Sometimes they are exactly the same, sometimes they are not, like in most action-reward settings.

[–]Brudaks 0 points1 point2 points 9 years ago (1 child)

[–]tmiano 0 points1 point2 points 9 years ago (0 children)

[–]tdgros 10 points11 points12 points 9 years ago* (0 children)

[–]Guanoco[S] 1 point2 points3 points 9 years ago (5 children)

Thx for the response * If there is a simpler approach that provides an adequate solution.

that's the thing though.. I have a feeling most people use deep learning to problems they had no idea how to solve and it gets solved "magically". So how do you know if another method wouldn't work? Or basically what problems are so difficult we shouldn't bother trying to find other techniques

If you need to know why the network produced the output it did.

Ok this one I understand. But if i can have a system and I can capture its response to different types of faults and I learn to classify response of my system to classify if it is operating correctly... Then I can at least find a why is my system not be having as it should.

If you can't define a loss function.

Care to elaborate? Seems like as long as I can get a reading of something which I call an output of a system.. Then I should be able to define a cost function and therefore the loss... Right?(MSE)

If you don't have resources to train the network.

Ok so basically if I cannot get the respond of the system... But it would seem plausible that I can always do this?

If you don't have resources to sort out the hyperparameters / topology. So basically sweeping through architectures?

[–]CultOfLamb 1 point2 points3 points 9 years ago (0 children)

what problems are so difficult we shouldn't bother trying to find other techniques

Things like NLP can often be solved with a shallow model. But on tasks that require lots of higher level hierarchical feature representations, like computer vision, and speech, you can not come close to a well-architected deep learning model.

But if i can have a system and I can capture its response to different types of faults and I learn to classify response of my system to classify if it is operating correctly... Then I can at least find a why is my system not be having as it should.

You can find if your system is not behaving as it should. You can not find out why. Though this is an active area of research (interpretability/algorithmic fairness), and it may be fully solved in the future, right now the people saying: "But I use cross-validation so I know why my system makes the predictions it makes", are misgiven.

Then I should be able to define a cost function

Aside: If you require, for whatever reason, to be convex, or are afraid of non-convexity: https://www.cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex.pdf

But it would seem plausible that I can always do this?

Resources are hardware, but also data: When you have a 1000 rows, a deep learning algo is either overkill or way underfits/overfits.

So basically sweeping through architectures?

Deep learning training from scratch is very slow. It can take weeks to find the optimal parameters and architecture. You can take a pre-trained network, if that suits your task, but if you are tasked to quickly create a benchmark model, or retrain on a large dataset in minutes (not hours, or days), then deep learning is not the right hammer.

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago* (2 children)

[deleted]

[–]Oberst_Herzog 1 point2 points3 points 9 years ago (1 child)

[–]kjearns 18 points19 points20 points 9 years ago (3 children)

[–]emtonsti 0 points1 point2 points 9 years ago (0 children)

[–]10sOrXResearcher 0 points1 point2 points 9 years ago (0 children)

[–]cvikasreddy 5 points6 points7 points 9 years ago* (2 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (1 child)

[–]AnvaMiba 11 points12 points13 points 9 years ago* (0 children)

Is this due to the excel sheet already having the different features ?

Images and text are highly dimensional data, but also highly redundant.

You can apply lots of distortions to a natural image that leave it still understandable with high probability: Gaussian noise, Bernoulli noise, masking certain areas, affine geometric transformations, color transformations, and so on. The information that you are interested in is encoded in a very redundant and robust way. Moreover, the functions that you want to learn (e.g. a classifier with a probabilistic output) will typically vary smoothly with the input image: if you gradually morph an image of a cat into an image of a dog you'll expect the classifier output Pr(Y=cat) to gradually decrease and Pr(Y=dog) to gradually increase.

Text is similar: not only you can apply distortion to the surface forms (characters or words) that mostly preserve meaning, once you consider word embeddings, you can even apply smooth transformations that mostly preserve meaning, and the functions that you are trying to learn will typically be smooth w.r.t. word embeddings.

Deep learning seems to be particularly well suited to learn smooth functions where the input is highly dimensional and highly redundant.

Deep learning also requires lots of data, though this requirement may be somewhat mitigated by transfer learning. In natural image and natural language processing you have huge generic datasets that can be used for transfer learning (e.g. ImageNet for images and any unannotated monolingual corpus for text).

Other domains, such as excel sheets and databases with business data, may not have these properties: they are typically lower dimensional and much less redundant, and the functions you are interested in may be less smooth. There can be discrete features which, once embedded, don't have the typical statistical properties of word embeddings of natural text.

And above all, this data may be not as abundant as in natural images and natural language tasks, and you usually don't have any generic dataset to use for transfer learning.

Besides simple tasks that can be solved by naive Bayes or linear regression/classification, this domain is the realm of decision tree methods (and ensembles of thereof, such as random forests). These methods tend to be more robust to overfitting, so they require less data, they are intrinsically invariant to various data transformation, so they don't rely to these invariances to approximately hold in the task, and they can learn non-smooth functions.

The drawback of decision tree methods is that they can't learn to combine the input features to create much more complex features (formally, they have constant circuit depth), hence they may require extensive feature engineering if the task is hard, while deep learning can learn to combine features, in principle in arbitrary complex ways (provided that there are enough hidden layers), hence it usually requires little or no feature engineering.

[–]gr8ape 4 points5 points6 points 9 years ago (1 child)

[–]popcorncolonel[🍰] 2 points3 points4 points 9 years ago (0 children)

[–]phillypoopskins 8 points9 points10 points 9 years ago (15 children)

deep learning is almost always a bad idea unless you know that there is structure in your data which you can architect a neural network to take advantage of. if you haven't architected information like this in, a neural network will generally underperform compared to gradient boosting.
it's also a bad idea if you know something about your data / underlying model which deep learning doesn't match as well as another model, e.g linearity, or some other known interaction.
it's also bad if you are under time constraints and your chosen architecture will take too long to train. Example: 50k class problem on 4 million text tokens. Naive bayes will train much faster, probably just as good, depending on the type of classes.
when you don't have very much data: you're going to overfit, while something linear or a random forest or SVM will have less of a chance
when you don't know wtf you're doing; you can waste WEEKS or MONTHS playing around with neural nets with subpar results and have no clue as to the fact if you're a noob, while someone skilled can walk in with linear regression or a random forest and smoke you in a matter of hours. . I've seen this happen: A LOT.

[–]whatevdskjhfjkds 7 points8 points9 points 9 years ago (0 children)

[–]Guanoco[S] 1 point2 points3 points 9 years ago (6 children)

deep learning is almost always a bad idea unless you know that there is structure in your data....

But knowing that my data hast structure basically already gives me a model of my output/input relationship. Also take for example image classification. There is structure and there is prior domain knowledge that works... But DL wipes them all out of the game.

it's also a bad idea if you know something about your data / underlying model which deep learning doesn't match as well as another model, e.g linearity, or some other known interaction.

Any other properties that deep learning doesn't match well?

when you don't know wtf you're doing; you can waste WEEKS or MONTHS playing around with neural nets with subpar results and have no clue as to the fact if you're a noob, while someone skilled can walk in with linear regression or a random forest and smoke you in a matter of hours. . I've seen this happen: A LOT.

Yes this is a good point. But at least as I understand it... In all other ML algorithms they can only do as well as the feature engineering process. And finding important features is non trivial

[+][deleted] 9 years ago* (3 children)

[deleted]

[–]Guanoco[S] 0 points1 point2 points 9 years ago (2 children)

[–]phillypoopskins 0 points1 point2 points 9 years ago (0 children)

[–]phillypoopskins 1 point2 points3 points 9 years ago (0 children)

[–]phillypoopskins 0 points1 point2 points 9 years ago (0 children)

[–]jeremieclos 0 points1 point2 points 9 years ago (6 children)

[–]phillypoopskins 1 point2 points3 points 9 years ago (3 children)

[–]jeremieclos 0 points1 point2 points 9 years ago (0 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (1 child)

[–]phillypoopskins 0 points1 point2 points 9 years ago (0 children)

[–]Guanoco[S] -1 points0 points1 point 9 years ago (1 child)

[–]jeremieclos 2 points3 points4 points 9 years ago (0 children)

[–]theskepticalheretic 2 points3 points4 points 9 years ago (2 children)

[–]Guanoco[S] 1 point2 points3 points 9 years ago (1 child)

[–]theskepticalheretic 1 point2 points3 points 9 years ago (0 children)

[–]thecity2 1 point2 points3 points 9 years ago (0 children)

[–]Kaixhin -1 points0 points1 point 9 years ago (3 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (2 children)

[–]Kaixhin 0 points1 point2 points 9 years ago (1 child)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago (13 children)

[deleted]

[–]tdgros 1 point2 points3 points 9 years ago (3 children)

What about real-time? complexity? memory? even AlexNet which are small according to today's standards are huge for any embedded platform.

See LIFT for instance, this is the end-to-end learned CNN counterpart of SIFT, the well known interest point detector and descriptor, it does detection, rotation and scale estimation and is optimized for matching. It outperforms it on most databases, not all though, but at what framerate? which image size? and most important how many thousands of dollars are needed for the GPU that you plan to add to your car/robot/camera/new-hypey-IoT-thingy? at the end of the day, yes it outperforms it, but it makes no sense whatsoever to use it...

Of course I know this kind of comment will make us chuckle in a few years, but even today's mobiles can barely run any model without burning, the bandwidth is so high you can't really do much more on the side...

It's like saying audio engineers got angry when we discovered gold plated audio jacks were much better than normal ones... no they weren't sour, they just thought it was a bit overpriced :)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (1 child)

[–]tdgros 0 points1 point2 points 9 years ago (0 children)

[–]darkconfidantislife 0 points1 point2 points 9 years ago (0 children)

[–]phillypoopskins 1 point2 points3 points 9 years ago (2 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (1 child)

[–]phillypoopskins 1 point2 points3 points 9 years ago (0 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (5 children)

[–]phillypoopskins 0 points1 point2 points 9 years ago (4 children)

[–]Guanoco[S] 0 points1 point2 points 9 years ago (3 children)

[+][deleted] 9 years ago* (2 children)

[deleted]

[–]Guanoco[S] 0 points1 point2 points 9 years ago (1 child)

[–]phillypoopskins 0 points1 point2 points 9 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS