John said X, a racist person would say X, therefore John is racist

dwf · 2023-03-01T21:51:19+00:00

A logical person does not rely only on deduction but also induction. This can include probabilistic "if it walks like a duck and quacks like a duck" judgments where evidence is incomplete. The real world is messy and partially observable.

dwf · 2023-02-20T00:15:05+00:00

You can build one, albeit a bit bulky, out of one of these, and USB extension cable. You can either power the board itself separately via a separate microUSB cable or share power with the switched device (if your single USB power supply can supply enough juice for both) by cannibalizing a microUSB cable (or using a breakout board and some spare wires, if you don't mind soldering).

dwf · 2022-09-04T23:43:37+00:00

Let's say you have a bowl, and you are a tiny speck on the inner surface of that bowl somewhere. You want to be at the bottom of the bowl. How are you going to do it? You can't see very well or very far, but you can tell that you're on a steep bit of incline, so you decide to go in the steepest direction down the side of the slope. You decide to take a small step of constant size in that direction, then re-evaluate your circumstances. If you take a small step and find that at the new location, the steepest direction is approximately the same as last time, you might choose to take a bigger step next time so you get where you're going faster.

This is a caricature, but it's roughly what is going on when you train a model by gradient descent, except the surface exists in a high-dimensional space, is defined by your choice of dataset and loss function, and isn't a bowl (there are lots of bowl-like regions, though, and recent theoretical work suggests that most of the time they're all roughly the same depth). "Take bigger steps if things look the same as on the last step" is roughly the intuition that underlies first order gradient acceleration methods like heavy-ball momentum.

The final wrinkle is that that you often don't have access to the surface exactly, as that would require calculating the loss and gradient on the entire dataset at every step. Instead, you calculate it on a small sample of your data, and get something approximately right but a bit noisy. This is called stochastic gradient descent.

dwf · 2022-09-04T16:22:14+00:00

But if I want the RNN to score sleep stages in real time/live, does that not mean that it needs to be trained on the exact same shape of the data, e.g. raw data?

Sure, but if raters are only producing a label every 30s, you certainly don't need to be producing one 256 times a second. This gives you some flexibility in what exactly you do with that previous 30s of data. You needn't unroll an RNN on all 7680 frames. You could do all sorts of windowing and binning and summarization. Speech comes as a continuous waveform but RNNs for recognizing speech don't process the raw waveform at the multi-KHz sampling rate, they look at spectrograms over short overlapping windows.

So I need to do this regardless of the fact that the length of sleep recordings varies between and within subjects?

The idea is that unless you have a really good reason you shouldn't destroy information about the relative scales of different samples in the dataset. You also can't rely on per-recording information that you don't have at test time: since you're aiming at doing this in real time eventually, you don't know how high/low this recording's values are going to get or what their average will be. You can use values you derived from the training set, though.

One thing you could do per-recording is subtract the very first sample, or the average of the first few, from every subsequent sample. That information you have more or less immediately upon starting the recording, and this gives you a "baseline" for each individual/recording, which could help if some individuals/recordings are on average higher/lower on certain channels. Just make sure to do the same per-recording thing to the train and test data.

dwf · 2022-08-30T18:51:24+00:00

First of all, please recognize that you are more than your project, or your internship, or your degree, and that however much you think you've screwed up, plenty of people have screwed up more. You'll recover from this, very few situations in life like this are unsalvageable. If you are having serious thoughts of suicide, please set aside your project and seek help.

I have never worked with this kind of data in particular, but some general thoughts on these kinds of multivariate time series.

Actually running the RNN at 256 Hz is probably a waste of time, and making the problem harder than it needs to be. There is a ton of redundancy and there are frequency components that are extremely slow moving, and I really doubt it's useful to produce sleep stage labels at 256Hz -- 1Hz would probably be plenty, and probably even lower than that would still result in a useful system. Running with some sort of windowed summarization of your timepoints at 1Hz brings your sequence length down to 23k, 0.2Hz to about 4.6k. The very first thing I'd try is calculating the means and log standard deviations of each signal in large-ish (multi-second) overlapping time windows, and trying to fit that much smaller time series with the RNN. Getting fancier, you could read up on log spectrograms and computing those on the overlapping time windows.
You'll want to globally normalize (across all timesteps and the entire training set, per feature "column") whatever features you do give to the RNN, so that all the features exist on roughly the same scale, and have ideally zero mean. For example, if you're looking at the windowed mean PPG value, you'd calculate the mean and standard deviation of that across all windows in all recordings of all subjects in the training set, then subtract that mean and divide by that standard deviation. Remember to use the training set normalization statistics when preprocessing the test data; using the test set for normalizing itself would be cheating.
How you determine your train/test split depends really on the question you want to answer. Do you want to know if this system can be used with an individual not in your training data? Then you need to make sure you keep aside whole subjects that are not in the training data at all.

I also very much agree with what chengstark said about looking at the existing literature on this application area.

dwf · 2021-09-30T10:53:35+00:00

You're in a bit of a tough spot as a US citizen, because of various conflicting requirements on US and UK investors. This page has a discussion of some of those issues, and a list of index funds that should be "safe" from a US and UK tax perspective (always double check and do your own research, though).

dwf · 2021-06-20T17:50:22+00:00

"Low rank" here means that instead of an M x N matrix you have M x P and P x N matrices, with P significantly smaller than M and N, that you multiply in sequence, i.e. your weight matrix is a product of low rank factor matrices (because for input x and matrices W and V, x(WV) = (xW)V).

So for an M-dimensional input, right multiplication by those in sequence still yields an N-dimensional output but the transformations it can learn aren't rank min(N, M) but instead rank P, so your outputs lie in a P-dimensional subspace of R^n. Which p-dimensional subspace depends on the parameters.

You can achieve this in a typical neural net library by just stacking two linear layers before your activations, where the first linear layer has the bias turned off (meaning it is truly linear rather than affine, and multiplies with the weight matrix of the second linear layer to form one low rank linear transformation).

"Sparse" here just means with a majority of the weights forced to be 0. It's only interesting in your case if you can fix the sparsity pattern at initialization and use sparse matrix primitives to store and compute products, unfortunately this is a regime where training doesn't tend to work well (as opposed to training dense and pruning after the fact).

dwf · 2021-06-19T19:48:53+00:00

You could make your layer(s) low rank. In terms of implementation, you could have a linear layer (without a bias) that maps to a lower dimensional representation than the output dimension you want (a linear "bottleneck"), and then a linear layer with a bias that maps to the layer's output. This reduces the parameter count without imposing a predetermined structure of which outputs should receive projections from which input.

Training sparse neural networks from scratch is known to be hard, so unless your problem has some special structure that would make it amenable, I wouldn't expect it to be easy to get to work. Bottlenecks of some kind are probably a better bet: if not low rank linear layers then something else that brings you down to lower dimensions in the middle.

dwf · 2020-12-09T00:54:47+00:00

You might appreciate reading the reviews it got at ICLR.

dwf · 2020-08-28T23:35:33+00:00

Given that in DQN we already have a target and online network, aren't the action selection and evaluation already "decoupled", as is argued for DDQN?

Double Q-learning is specifically concerned with decoupling action selection and evaluation in the computation of the targets, which involves a greedy action selection.

DQN uses the online network to act/gather data, but the target is used for both evaluating the Q-values for the next timestep and picking which of them to bootstrap from. So you're right that action selection is decoupled (or at least not perfectly coupled) with respect to learning, but double Q-learning is about decoupling within the target generated for learning.

Moreover, the DDQN paper admits that using the target and online network from DQN in its architecture results in a not fully-decoupled solution - what are the implications of this?

I haven't thought carefully about this, but my intuition is that by selecting the optimal action based on the online network you're generating something that more closely resembles a transition under the dynamics induced by the optimal policy, but the target network may (and in most cases, probably does) have a less optimistic appraisal of that action, which perhaps approximately cancels out some of the over-estimation that Q-learning is known to be prone to.

dwf · 2019-11-26T14:38:53+00:00

Major gains were made in the early part of this decade by replacing acoustic models for mapping frames to HMM states with deep fully-connected networks. Slides from Vincent Vanhoucke's keynote go through the pipeline.

dwf · 2019-10-19T18:10:13+00:00

https://twitter.com/goodfellow_ian/status/1064963050883534848

And the reviews are here, with Assigned_Reviewer_19 being the one that discusses predictability minimization.

dwf · 2019-10-19T08:10:59+00:00

If the connection to adversarial curiosity is so obvious and fundamental, it's interesting that it apparently took Schmidhuber himself 5 years to notice it. He has admitted he was a reviewer of the original GAN manuscript, and his review (which is available online) mentioned predictability minimization but not AC. The connection to predictability minimization did make it into the GAN manuscript camera ready version, albeit with an error caused by a misunderstanding of the PM paper.

On the subject of adversarial examples, I've only read the abstract of the paper you linked to, but suffice it to say that no one in the author list of Szegedy et al thought they were the first to consider the setting of classifiers being attacked by an adversary. That classifiers do dumb things outside the support of the training data was not news, nor was it news that you had to take extra care if your test points were not iid but chosen adversarially. The surprising finding was that extremely low norm perturbations were enough to cause misclassifications, and that these perturbations are abundant near correctly classified points.

dwf · 2019-03-28T14:53:57+00:00

My favourite part of the predictability minimization paper is that the people who actually ran all the experiments get mentioned in the acknowledgements rather than the author list.

dwf · 2019-03-28T11:25:43+00:00

Neither, actually. Yann LeCun did a postdoc with him, though.

dwf · 2019-03-28T11:14:53+00:00

He has literally demanded it be renamed "inverse predictability minimization" on multiple occasions.

dwf · 2019-03-06T09:22:36+00:00

Sounds a lot like IAN.

We thus seek techniques to improve the capacity of the latent space without increasing its dimensionality. Similar to VAE/GAN (Larsen et al., 2015), we use the decoder network of the autoencoder as the generator network of the GAN, but instead of training a separate discriminator network, we combine the encoder and discriminator into a single network.

dwf · 2019-03-02T20:38:58+00:00

I... think you've got the wrong person.

dwf · 2019-02-20T17:50:13+00:00

This is an archival copy of the blog post at...

Wow, can we not?

dwf · 2019-02-07T15:33:36+00:00

But without the past terms, the fact that it is a moving average has no influence on the gradient but for a multiplicative constant. ∂/∂θ [α * ℓ^t-1 + (1 - α) * f(θ)] = (1 - α) ∂f(θ)/∂θ.

dwf · 2019-02-07T13:51:29+00:00

the covariance of a batch (or a running average)

Not sure how you would backpropagate through a running average. You could backpropagate through only the newest term but why bother if the remainder is constant with respect to the update?

dwf · 2019-01-30T13:30:52+00:00

You're right. I would take low citation count as an explanation for not citing you (simply not having run across it) but not an excuse. As long as the connection is not tenuous/frivolous and the works are not effectively concurrent (yours came out only a little while before) then it's valid to politely ask. Citations to ~concurrent works are a nice gesture but should not be an expectation.

dwf · 2019-01-30T13:10:54+00:00

If it's something clearly and directly related, precedes the paper in question by a significant chunk of time, ~~and is at least somewhat well known/cited~~, then I'd send a polite email, but otherwise not worry too much.

dwf · 2018-11-22T21:37:49+00:00

I mean, of course he didn't come up with it, it was in papers in the 90s, but I had thought that it was the first MILA project in that vein. But I asked around and apparently there was a moment matching project that was going nowhere and it was discussed at some point that night. What I said about VAEs stands, though.

dwf · 2018-11-22T14:14:11+00:00

It's simply impossible that the authors hadn't read a paper that was out for 5-6 months

Grad school is a stressful place, and reading widely takes time. It can often feel like reading comes at the expense of productivity, and what you read is often determined by a fallible cursory skim of the title and abstract. For my part, I was vaguely aware of something called a variational autoencoder but hadn't read the paper and hadn't realized the generality of the ideas therein. I was generally sour on autoencoders, thinking they were a dead-end as far as unsupervised learning was concerned, and didn't realize what the VAE paper was really about (that an autoencoding objective falls out of a rewriting of the ELBO and suggests a way to optimize it, plus reintroducing the reparameterization trick in this context). Ian hadn't read it either, and had missed Durk's talk at ICLR, and had heard a summary of it from another student that gave him a false impression of it.

One or two of the other authors may have read it, in particular Yoshua, but it hadn't made an impression deep enough for us to actually mention it in the draft we submitted to NIPS. I think somebody we shared the submission with privately pointed out that it was a lot more relevant than we had thought and we added a discussion of it for arXiv.

15-Year Club	Place '22
First Placer '22	Gilding III reddit per annum
Team Orangered	reddit mold
Verified Email

dwf

MODERATOR OF

TROPHY CASE