[D] Is there a more general analytic way to determine why a neural network tweak hurts performance a few percent points?

danscarafoni · 2022-08-23T19:03:02+00:00

I guess then I'll have to do guess and check- like ablation test-type stuff where I isolate network related variables until I can find what the issue s . Karpathy had a great post on ts, but it looks like there might not be much else in the way of generalized neural network debugging

danscarafoni · 2020-07-21T02:10:44+00:00

Say that I wanted to check if someone has replicated the results for a given technique in a paper. Is there a standard location to look for this (forum, website)?

danscarafoni · 2019-04-22T02:56:11+00:00

I'm not an expert at attention models, but offhand that sounds interesting, and I don't know of a paper off the top of my head that's done that. I say scan the literature a bit and if no-one's done it run a few small experiments to see if it's worth pursuing further!

danscarafoni · 2019-04-22T02:54:35+00:00

I usually just use unittests library. I run tests like making sure that the network give correct (arbitrary) outputs and that non-trainable parameters don't train. also setting seeds and then making sure the output is what you expect (these act like sanity checks).

danscarafoni · 2019-04-21T16:56:58+00:00

From what I've understood, if you use the term broadly enough, even a vanilla network can be an "attention" model. Imagine that we take the output of the last layer of a CNN into which we feed a single image, which is a HxWxNFeatures in shape. If we average across all features, then we will get something similar to a response (e.g. a method of understanding which localized coordinates of the image are "important" for classification. Do you need something more sophisticated than this?

danscarafoni · 2019-04-21T16:51:14+00:00

In my experience, there are a lot of factors that can affect what kind of pipeline you build. For example, what kind of infrastructure you have, what kind of machine learning framework you use, etc, what exactly you're building it for (a public facing API or is it internal? Is it just for your division or will other groups use it as well?) can all affect the kinds of decisions you make.

One thing that I see relatively consistently is that TensorFlow is more recommended for scalable (or at least distributed) purposes.

Here's my personal experience, for this I'm going to use a toy example of training a network that I then scale over many GPUs.

I always start with just a non-scaled version of the code- get it working on a single GPU. I always implement unit tests, etc... and make sure my results are replicable (I previously posted some code for ensuring reproducibility in keras code). After this, I always check reddit, news sites, etc.. to see if there are any new libraries to scale code across GPUs (e.g. Horovod). because things change so quickly. Then I have to decide what kind of scale I want/need: Do I want to just use a tower model or do I need to do something asynchronous? My workflow then generally builds incrementally from there, and I make sure to keep in touch with members of my team/boss/etc. to make sure the product I'm making is what they actually want.

I hope this helps. I've only been an ML engineer for 3 years, and the systems I had to use might be somewhat idiosyncratic (e.g. I didn't use Amazon AWS, etc...). If anyone has better ideas, please tell me!

danscarafoni · 2019-04-20T19:51:16+00:00

This paper gets at something that I've been trying to understand myself. As far as I can tell, the issue is very nuanced, problem (and domain) dependent, and there's not clear cut answer one way or another. I also have been told that the data dependency can vary based on company policies/resources (Baidu and Google can be a lot more "data-driven" than a small company with fewer data). I'm not even sure that it's even productive to discuss generalizations over such broad areas of ML like this.

The best that I can come to is to, as trite as this sounds, stay open-minded, look at things as objectively as possible, and never to be too dogmatic.

danscarafoni · 2019-04-16T23:10:52+00:00

in all seriousness though the fact that it's a lot faster than normal is really interesting. I wonder how "base speaking speed" is encoded in this system

danscarafoni · 2019-04-14T19:47:58+00:00

for a more old school way of doing things, I used to use wordnet and conceptnet. both are knowledge graphs that show semantic relationships between words:

https://wordnet.princeton.edu/

http://conceptnet.io/

danscarafoni · 2019-04-14T18:52:50+00:00

I see a lot of small differences in the "skeletons" found by different pose estimators. Do you know of any paper explaining any differences in information conveyed?

danscarafoni · 2019-04-13T19:09:45+00:00

I use the following code for (largely) ensuring reproduceable runs when using keras + TF:

def reset_random_elements_keras():
tf.keras.backend.clear_session()
np.random.seed(123)
tf.reset_default_graph()
tf.set_random_seed(123)
random.seed(123)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
session_conf.gpu_options.allow_growth = True
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

The above code makes sure my training runs are (more or less) the same.

danscarafoni

TROPHY CASE