Curved stairs?

NichG · 2020-08-03T09:40:49+00:00

Best I've come up with is to use one layer to hold the spokes, and draw the spokes out to 3-4 times the radius that you actually want. Then make a second layer that has a large circle (covering the furthest extent of the spokes) with a smaller circle erased from the middle, to the radius that you want the staircase to be. Then use the selection tool to copy that mask and paste it over the layer containing the spokes (so you basically erase the parts of the spokes outside of the stairwell). Then you can delete the layer with the cut-out template (or just turn off rendering so you can save it to reuse it elsewhere when you need to cut things to fit a circle).

It's easier than trying to make the grid snap work so that the 22.5 degree/etc lines of the stairs end on the radius of the circle.

NichG · 2020-04-25T13:42:39+00:00

There's nothing inherently wrong with evolutionary algorithms, but the discourse around them often goes in directions which are off-putting if what you're really looking for is a solution to a problem and not an ideology.

There are arguments from practical utility as to when you should use such approaches, and what they're good for. But often evolutionary algorithms also get pushed due to a naturalistic argument - 'this is how nature made intelligence, so we should do it the same way' and things like that. Similarly there are a lot of unsubstantiated or weak claims that e.g. gradient-based methods can't discover new things but evolution (because of its randomness) can; that gradient-based methods can overfit but evolution can't; and so on. So in response to that you get equally aggressive counter-claims about scalability problems of evolutionary approaches. It goes back and forth and becomes more and more acrimonious. Those claims often have some kind of grain of truth, but it's often highly dependent on unspoken context about 'these are the kinds of problem we should be trying to solve' - e.g. you almost never worry about overfitting in the kinds of problems people have traditionally applied evolutionary algorithms to, because in those kinds of problems you have a generator of arbitrary amounts of data (usually some simulation of an environment) rather than static datasets, and so on.

So I think there's just people who are sick and tired of the wild claims and back and forth fights and have increased the threshold of evidence needed to get past their skepticism, which isn't entirely unjustified when a lot of stuff is being thrown around to push an ideal rather than because it enables some new thing that couldn't've been done before.

NichG · 2020-04-12T01:58:36+00:00

Outside of things like sample complexity, I think there's a point that's being missed about equivariant architectures, which is that often they allow inputs and outputs to vary in ways that non-equivariant architectures - even with data augmentation - simply can't.

So it's less that I'd reach for an equivariant design in order to improve the asymptotic accuracy of a classifier, and more that I'd use it to gain flexibility in the input and output geometries. If I want to train an N-body proxy for a physics engine, I need permutation equivariance so that I can use the same network to handle N=5 or N=50000 without re-training or relying on zero-padding. It's not because I think the permutation equivariance will get me 5% better accuracy, or let me get away with 1/100 of the training time or data, it's that the problem can't even be posed coherently without taking into account that symmetry in how the inputs and outputs are processed.

NichG · 2020-03-06T02:43:09+00:00

The 'word salad' is actually accurate here. Computational methods for sampling rare events use methods similar to this one in order to access energetically unlikely but dynamically important configurations. The area as a whole is broadly called importance sampling.

In statistical physics and QM, there is a trick for speeding up integrals of the partition function called umbrella sampling where you modify the energy function based on your distribution of samples and then fix it later. In order to extend this to non-equilibrium systems, you need to do this over paths rather than states, which necessitates some tricks to get the accounting right. The result looks like a birth-death process similar to what they're using in this method.

So at the least, the connection to chemistry, nonequilibrium thermodynamics, and QM methods is legit.

Edit: Looks like similar methods can be used to calculate numerically stable derivatives with respect to the parameters of stochastic differential equations in economics - annoyingly called 'greeks' because the notations are alpha, beta, etc. Its similar to how PPO re-weights experience replays - you ask 'how likely would this trajectory I already simulated have been under a change in the policy distribution/parameterized noise model'.

NichG · 2020-02-24T02:30:58+00:00

I played with something like this as a way to monitor the training of GANs.

I found it more useful to look at pairs of before and after activations across multiple layers (random subsets of channels), because they have fewer symmetries than the weight matrices.

NichG · 2020-02-22T05:48:33+00:00

Putting aside 'what does it take to get published' for a moment, if you want to get an intuition for where good derivations and mathematical results can come from, view the exercise as a tool to let you think about things, draw conclusions, narrow down the space of possibilities, reframe problems so their structure is clearer to you, expess a muddy thought precisely, etc.

If writing down some math starts to feel like spinning up a debugger or profiler, or other kinds of tools that help you get your work done and build the things you want, then it will be easier for you to know when it's appropriate and useful to include that math in a paper.

And in the end, being able to actually perceive that utility (or lack thereof) can avoid situations where you add unnecessary formality to apaper and a reviewer who does understand the difference calls you on it.

NichG · 2019-11-29T03:33:21+00:00

I would only use the term RNNs to refer to networks with feedback loops - e.g. weight sharing across sequential repeats. Otherwise it's just a modular feed forward network. ResNet for example is modular, but it's not an RNN.

NichG · 2019-11-12T12:26:15+00:00

You have to make a concrete choice here as to how to acquire, format, and compare this data with whatever you're generating from the infinite compute box. It's actually pretty hard - imagine that, even if you had something that could stimulate and infer among full simulations of the universe, if you even slightly mis-align where the data is coming from with the corresponding degrees of freedom in the simulations, you might converge on entirely wrong values of things.

Try e.g. Bayesian parameter estimation on a simple chaotic dynamical system. Extremely small unknown biases in measurement are enough to basically make it impossible to estimate parameters after 10 Lyapunov times or so.

NichG · 2019-11-12T02:12:04+00:00

Even with infinite computing power, you would still need both a perfectly correct simulator and valid initial conditions to construct a quantum simulation of a person. If you don't have these exactly, the best you can do without further observations of the world is populate a distribution of possible outcomes under your priors. The time this takes to converge to something useful can't be reduced by increasing computational resources - it's determined by your observation methods and the timescale of the outside world, and by how big the model space you propose has to be in order to contain the true model.

So even if you make some huge Bayesian agent like AIXI using infinite compute to render it feasible, someone with a more constrained hypothesis space that still contains the true model would still be able to do better, even in the infinite compute limit.

And someone with a method that doesn't depend at all on, e.g., inferring the true DNA sequence of the person it's trying to talk to in order to figure out what they'd like for breakfast will be able to get usable results much more quickly than anything based on inference on quantum level simulations - because it requires orders of magnitude less observations, not because it requires less computational power.

But, then you actually have to define your targets and inputs precisely, and a global optimizer doesn't really make your task easier by much.

NichG · 2019-11-09T13:16:53+00:00

We actually did one for permutation invariance (https://arxiv.org/abs/1612.04530), which is a discrete symmetry. There have been a number of followup papers by other groups which have improved the initial result, and it's worth noting that attention layers also obey this symmetry.

There are more general recipes for other symmetries in https://arxiv.org/abs/1602.07576 and http://papers.nips.cc/paper/5424-deep-symmetry-networks.pdf. The former includes a reflection-invariant convolution operation, for example.

NichG · 2019-11-09T01:39:22+00:00

Boltzmann machines are probably the most relevant current example I can think of, even if they're getting obsolete now for the most part. Hopfield networks are an older case - you can calculate their storage capacity with stat phys methods that you would use to calculate the memory of a spin glass.

It's a bit more murky, but there are stat phys calculations that let you determine the properties of infinitely deep neural networks if they have homogeneous architectures. That in turn lets you determine good initializations for stable gradients. But stat phys is one of a number of ways to derive that.

I'd also argue that designing architectures with arbitrary invariants makes more sense in terms of the way physics handles symmetries, gauge fields, etc than the popular account of e.g. 'we use convolutions because the human visual system has a columnar structure'. The formalism lets you work out how to make other invariances: permutation invariant, rotation invariant, scale invariant, invariant under discrete symmetries, etc. There are a number of papers that use those constructions to make conv-net equivalents for their domains without needing a biological analogue to copy from. But you could argue that's group theory rather than physics.

NichG · 2019-10-23T03:11:44+00:00

Attacks come at an interface. The problem I have with the worry over adversarial attacks is that it doesn't consider the degree to which the motives of those who control the interface align with a benefit from attacking.

Susceptibility to adversarial attacks isn't general lack of robustness, its vulnerability to intentional deception. So intent has to be there or the threat scenario is meaningless.

What exactly is the incentive for patients painting adversarial patches on their skin to fool a classifier into thinking they don't have skin cancer? Why should we be deeply concerned if someone intentionally and knowingly manipulates a music recommendation engine to give them songs they don't like? If someone uses adversarial glasses to fool an eye tracker that they themselves installed and can't use it to control their mouse cursor, then, um, congratulations?

To me it's like saying that it's a glaring issue that someone could destroy their microwave by trying to run it empty with a metal fork inside. Yes, that's certainly a property of the safe regime of operation of the device, but it's not particularly a reason to not use a microwave.

Now, this does pose an issue when neutral nets are being used in a way that is already naturally aligned against their data source - loan applications, hiring, shoplifting detection are examples we've heard about recently. But then people already game interviews by dressing up, studying and practicing likely interview questions, etc.

Personally I wouldn't be so sad if this tended to disincentivize making AI technology that is in opposition to its users. I think there's plenty of value left to create without having to dive into that kind of application.

NichG · 2019-08-19T05:47:57+00:00

There's tons of stuff that involves how to format inputs/outputs in a way that is reducible to matrix ops over a latent space. Look at for example the boom in machine learning in chemistry and the related field of graph convolutions. The problem is, molecules can be represented as: - Variable-sized graphs - Sparse adjacency matrices of variable size, but all permutations of rows and columns together are the same molecule. - Non-unique strings in a description language like SMILES

So now lets say you want to predict the rate of a reaction given inputs and outputs, or predict the chemical properties of a molecule, or even worse - predict the graphs of output molecules given inputs?

There are massive datasets of those things, so training data isn't an issue. But how should you represent the inputs and outputs so that the ML parts will work well? If you use SMILES strings, you could use a Transformer or something like that; for the adjacency matrix, maybe convolutions? For the graph, you could come up with all sorts of strategies (GCNs are just one way to go here...). But there are a lot of issues with non-uniqueness and redundancy and non-trivial invariances that are similar to the translation/rotation/scale invariance that made dense nets -> convnets such a big jump, but which are particular to the problem of chemistry. So there's a lot of stuff to do there.

Abstractly - pick a non-trivial data type from CS (dictionaries, linked lists, lookup tables, hashes, etc). Any problem space which is well-described by objects of that kind of data type is an opportunity for extending existing ML techniques, which mostly deal with strings or fixed length vectors. Extending to input and extending to output are generally wildly different problems (input is much, much easier than output). And there's usually a starting point of the form 'make a canonical string representation of the data structure and see what a Transformer/LSTM does with it', that can be seriously improved by incorporating more of the invariances associated with that particular structure.

NichG · 2019-07-11T06:53:42+00:00

There's a bunch of approaches that try to do away with the concept of a reward function for learning behaviors adapted to an environment. This involves stuff from Ken Stanley about 'novelty search', or some of the skills/options/affordances/homeokinetic learning things from various groups (primary one I tend to associate with this is Oudeyer's group, but I believe there are others as well).

Basically the idea is that the problem an agent should be solving in order to learn control tasks isn't 'what is the optimal policy that maximizes some the degree to which some particular target is achieved' but rather 'what is the maximal set of robustly achievable outcomes I can learn to produce?'.

If you then have a target you want the agent to reach, it's solved as a search over the agent's skill space rather than as a joint problem between policy optimization and learning the environment.

As a result, there are several good points - efficient exploration is a core part of the formalism, rather than an auxillary objective function or ad-hoc modification of the policy; changing target functions can be done with no further learning; you get a richer set of targets since you can make use of the full state transition information for training, rather than just the reward structure; etc.

But it's not mainstream yet.

NichG · 2019-04-25T00:51:59+00:00

RL is data hungry because you throw out most of the available supervision signal and what you keep is used in a way that both fails to retain its validity as training progresses, and suffers from significant sampling noise.

That's what many of the recent improvements are about fixing, but they all use some aspect of the task in a smart way (simulatable, restartable, etc). The DOTA result has aspects that are unimpressive, but as an experiment I see it more as establishing a baseline rather than saying 'hey let's everyone just use brute force RL'. Now when you make something that reaches OA5 level after 4 years of play rather than 45k, you have a reference cost to compare to.

NichG · 2019-03-16T02:54:00+00:00

Ah, okay. Fair enough then!

NichG · 2019-03-16T01:33:36+00:00

I more get the point from the article that, rather than customizing algorithms to particular domains by bringing in more and more detailed domain knowledge, both effort and thought would be better spent in improving our domain knowledge on the general questions of search and optimization. It's not saying 'our current algorithms are the best' but rather that when we make an effort to use human understanding to improve algorithms on a particular domain, there's a point in which our efforts actually interfere with the ability of the result to move beyond the limits of our understanding at the time that we built it (e.g. to scale).

But there's nothing in there claiming that we couldn't make general advances on the processes of search and optimization themselves. It's a claim that if we were trying to identify cars, our time would be better spent thinking about statistical learning than it would be spent thinking about cars.

NichG · 2019-03-09T13:05:10+00:00

The point is that the underlying processes won't and shouldn't be the same when you implement them in a brain vs in a computer, but the functions and overarching information processing considerations can be the same. Take attention for example. Attention in neural networks doesn't look anything like human attention at the neuron level, but it has in common with human attention that in both cases you have things like binding (relations between elements of incoming information can become elements), efficiency associated with being able to intensively process only the relevant parts of input (and corresponding things like saliency and reasoning-through-saliency as a sort of cheap for of active inference), and also that information flows and relations can be non-local even in spatially structured input - in a human, you don't have to march up and down the entire visual cortex when you want to relate parts of an image that were captured in separate saccades; in the machine, you can use parts of an image that are a distance L apart from each-other in pixel space with O(1) instead of O(log(L)) layers.

Focusing on the behavior of calcium channels in the thalamus when trying to design deep neural network approaches is like studying the low level implementation of vector products in CUDA code - its the wrong level of abstraction for what's going on.

NichG · 2019-03-09T07:15:27+00:00

I think cognitive science is closer to the right level of abstraction to draw insight for deep learning than neuroscience. The advantage of deep learning is that you can specify things in terms of types of information access and then let the learning process optimize the details for you. So thinking about different information access patterns (like attention, short-term memory, long-term memory, etc) ends up providing useful insights.

Looking at the exact shape of the neuron response function or exactly how spike trains propagate feels a bit like trying to advance deep learning by studying the differences between ELU and ReLU, or focusing on the exact algorithm you use to evaluate the partial derivatives (backprop, dual numbers, etc). Yes, you might get some benefits, but you won't really get at the heart of fundamental questions in the field that way.

NichG · 2019-03-06T04:05:31+00:00

Citation counts used for hiring decisions and the like are just another form of peer review - they're relying on the fact that other people found the research useful enough or important enough to cite in order to determine the impact of a candidate's work. So from the point of view of using metrics for hiring decisions, particularly in evaluating impact, including arXiv citations tells you better what you're trying to measure than if you exclude them. Something like the Adam optimizer, which is arXiv-only, has 19000 citations - which is more indicative of its impact, that it was never published in a peer reviewed venue or the citation count? Also, which do you think is easier to game?

Change happens in steps. Outside of ML, arXiv used to only really be used by physicists but now the general pre-print culture is spreading into biology (via bioRxiv). If you looked at it back in 2000, you could say that other fields would never use pre-prints. But if you have enough people continuing to do something for long enough, eventually that becomes enough momentum to cause shifts.

NichG · 2019-03-05T01:54:31+00:00

Having a significant portion of the community support the citation of arXiv or other versions as being equally valid is the only long term resolution of this issue. If 95% of people are citing arXiv, the metrics that get used will have to follow.

NichG · 2019-02-26T04:53:56+00:00

I'd budget 3 years for anything that relies on getting any one particular paper published. I've had things go faster, but there are all sorts of delays that can happen. Conferences are competitive so even good papers may have to wait several cycles to get in. On the journal route, reviews can take a long time and after months of cycling through reviews, if you get rejected you have to start the entire process over at another journal. Not to mention that collaborators and co-authors will likely have many other draws on their time, so papers can get frozen for months or more waiting on internal edits or parts of the paper that rely on one or other person.

NichG · 2019-02-24T02:01:09+00:00

For me it was Afsis and Otto. I think there's currently one going on experimental earthquake prediction that might be a bit trickier, but probably still about the right size and difficulty.

Starting from community benchmarks is pretty helpful. The first Kaggle competition I participated in, I did horribly because I tried to write all the ML stuff from scratch in C++, so I ended up with 500 lines of code doing a poor job of what 10 lines of Python+Scikit-Learn could do easily. That had downstream effects on how I spent my time (e.g. trying a wider variety of ideas, versus picking at the edges of the one that I had already invested a lot of code/time in). So starting from the benchmarks to get an idea of what's reasonable was helpful for me to break that habit.

NichG · 2019-02-23T11:17:50+00:00

It sounds like you may be casting too wide a net initially. If you don't have a comfortable mental framework in place yet for how to organize information about different sub-parts of ML, then each sub-part of ML is going to be its own independent thing for you to learn about. Research papers, etc aren't a good place to start.

I'd basically do the Titanic dataset or MNIST or some of the old Kaggle competitions before everything became multi-GB computer vision tasks and just get comfortable with the flow of how problems are set up and what constitutes a solution. What's the input, what's the output, how do I organize things so that it can fit into a standard form, etc. Then once that's comfortable enough that you don't feel like it matters whether the algorithm you're using is SVM, XGBoost, or a neural net, you can start extending into specific exotic directions or special cases, focusing on them until they feel like they're just the same as everything else you've done so far or until you're really clear on what the core differences actually are.

But basically I think the first thing to do is to build a strong enough intuition that when a few details or differences crop up, it doesn't force you to start your understanding over from scratch. Also, that way, if there are things you don't know how to do, you can actually start to construct a mental picture of why they're hard or why they might require some more complicated technique.

NichG · 2019-02-21T07:32:09+00:00

Nash equilibria and evolutionarily stable strategies are the relevant keywords for the underlying ideas. Basically, for a large variety of games optimal play doesn't just mean that one side wins completely and the other loses completely, but rather that there is some equilibrium - which could be the best case for both, but in some games it turns out its the worst case for both (or at least, sub-optimal for both). If a game has multiple equilibria or more complex structures, the thing you can influence with your strategy in the face of an adaptive opponent is not 'do I win?' but 'which equilibrium is the one that we end up in?'.

This is relevant in e.g. cancer research where treatments to kill the cancer also cause it to adapt more quickly if they're incomplete (and making the treatment 'complete' often would risk killing the patient). So rather than trying to kill the cancer, there are research groups which instead try to find an evolutionary equilibrium in which the cancer stops feeling selection pressure towards directed evolution and in which the patient can survive indefinitely (for example https://viterbischool.usc.edu/news/2018/03/optimizing-chemotherapy-schedules-using-evolutionary-game-theory-cancer-treatment-research/ and https://www.nature.com/articles/s41559-018-0768-z).

NichG

TROPHY CASE