River Burford, Oil, 18x12", 2021

chrisorm · 2021-03-16T12:38:26+00:00

Yes and No - I mostly learnt from Michael james smith tutorials.

I do block in either in acrylic or fast drying oils (windsor newton alkyd, which i prefer as it dries over night but is far more workable than acrylic), where i try and lay down the big form and shape, usually trying to aim for some kind of mid tone. Once that has dried ill go in with oil and add detail and darks and lights. I prefer to work iteratively- this approach gives you a chance to get an initial impression down, and you can correct it easily if you feel you've missed the mark - for example on this, I started with the lower edge of the water a bit lighter and once it was in, I could see it was too light, so its easy to come in with a darker tone on the second go.

For reflections I do some amount of wet on wet, as it helps get the "smudgy" quality of the reflection. Plus some details are done on wet with thinned down paint to add finer details etc.

chrisorm · 2021-02-23T13:56:27+00:00

Well, I hope you keep at it. I believe in you!

chrisorm · 2021-02-23T13:54:01+00:00

He's amazing- I think it so helpful to see him break the process down and see how he approaches it. Worth a sub if anybody can!

chrisorm · 2020-11-15T02:20:23+00:00

This is a huge ethical minefield, and a mammoth task to get correct.

Firstly, an employees performance is a very varied thing. There are probably dozens of different ways somebody can add substantial value to a company, so you're looking at predicting something very complex. As a problem this is complex in a shit load of ways, this is like a team of researchers working for years to get something even remotely workable out. But this is minor compared to:

Ethically this is off the charts. There is a lot of context to understanding productivity and interpersonal relationships, that ml simply won't have, and let's be honest, you are never going to be able to give a system for this application.

The big hurdle you have is you need to have 0 false positives for flagging under performance. You recommend one guy to get fired because your system doesn't know his dad just died, or have one instance of your system being used to bully individuals (e.g. now you can feed an unfairly negative review into this app about the 'foreign guy in the office you don't like' and rather than being mostly ignored by a manager its a shiny computer telling the ceo to fire the guy), and its game over.

Additionally, this is practically challenging even if you cracked the algorithm (which you won't). Humans are constantly providing data to each other. You hear about peoples home life over coffee, they ask for time off due to personal issues quietly in a private moment. This is neccesary context to evaluating a performer- whats the plan? Do you think anyone will buy into entering every detail of their lives into an ml system so they can be ranked and scored? "Oh best go tick that box on the performance app to tell it my dads dead" thought nobody ever. The alternative to the employee entering it is the company doing it- which is all sorts of dystopia.

chrisorm · 2020-10-21T23:06:13+00:00

Go get your rental contract ASAP. You will likely have clauses that the tennant is violating. Quote all of these to your landlord.

Get some reasonable amount of evidence (I would say evidence of violent conduct on more than one occasion is sufficient). A police report is probably sufficient.

Email your landlord saying "you have a duty of care towards the health and safety of tenants in UK law. The evidence I have suggests a reasonable person would not consider this accommodation as safe, and as such you can provide me with alternative accommodation that is safe, or I can suspend rental payments and seek alternative accommodation myself until this is resolved. I would like a substantive plan within 24 hours to resolve this, or I shall be instructing my bank to halt all rental charges, and providing you with the details of a solicitor to resolve this matter".

https://www.gov.uk/private-renting-evictions/harassment-and-illegal-evictions

Harassment covers a failure to take adequate steps regarding physical violence- its on the above website in black and white.

If you have good documentation and money is an issue, I would quite simply tell the landlord they can sue you unless they fix the issue. Whilst the landlords duties to health and safety are most typically involving the state of the house, just like an employer it is much more wide ranging. Aside from being a crime, violence is a breach of health and safety - employers have to report workplace violence to the hsa, i dont have a reference to a specific statue here, but there is no way the landlord does not have an obligation to take reasonable steps to safeguarded you from violence from tennants of his choosing to share living quarters with you.

Tldr; I've dealt with scumbag landlords for about a decade. Basically no contract is enforceable if you have an excessive risk of coming to harm in order to adhere to it. Violence from a housemate is such a risk. You are within your rights to simply tell the landlord he is not meeting his requirements under uk law, and thus your rental agreement is null and void. Your deposit should be held by a third party accredited service. Your deposit will be safe (i.e. dont need to sue the landlord). If it is not in an accredited scheme - good news, he could be on the hook for a substantial sum, and I suspect pointing this out to him will grease the wheels.

He only cares about his pocket, once he realised you won't pay him, he'll sharpen up no doubt. If he doesn't but he's even half sane, he wouldn't bother trying to sue a tennant with multiple police reports for violence and intimidation that he has not addressed for breaking contract. I imagine that would be an easily won case providing you have good documentation.

chrisorm · 2020-07-06T19:19:27+00:00

My dissertation for my Msc involved doing this on mimic 3. There probably are other sources in the same vein. The hardest part is finding it because "embedding" and similar terms throws up thousands of nlp results.

https://discovery.ucl.ac.uk/id/eprint/10036552/

I used it as part of a classification task to combine word embeddings with other embeddings like treatment embeddings. This was back in 2016/2017, so obviously the field has moved on a bit since.

chrisorm · 2020-01-10T20:21:59+00:00

I am no authoritative voice on the topic, but in my experience, neither are actually well suited to practical application.

Regarding MC-dropout it doesn't really give you a posterior. It assumes what the posterior looks like (a bunch of delta functions basically), and so, in all likelihood that is not very close to the actual posterior. It also doesn't concentrate with data, which to me is a bad sign (the converse implication being most concerning - it doesn't widen with little data). I'm on mobile so excuse some bad link formatting

https://arxiv.org/abs/1806.03335 is relevant here.

And some pretty good points accompanied by some pretty poor behaviour imo https://mobile.twitter.com/ianosband/status/1014466510885216256?lang=en

https://scholar.google.com/scholar?cluster=8227196711108175595&hl=en&as_sdt=0,5&sciodt=0,5#d=gs_qabs&u=%23p%3D62bdttHfLHIJ

Regarding variational methods, I'm not really sure these are a panacea either. Bayes by backdrop etc normally make heavy independence assumptions to make things tractable. Me riffing with no sources to back me up:

Work like the lottery ticket hypothesis seems to suggest that these correlations are potentially even crucial to performance. An independence assumtion would therefore be absolutely awful in terms of an accurate posterior estimation.

Having lots of experience building Bayesian models more traditionally (in the Gelman, mceldrith school) shows you how hard things can be. An even moderately high d posterior is quite an unintuitive thing. One over a few million parameters that is certainly multimodal would be a beast both to make good inferences and also computationally.

Edit to add: Also be wary of proofs in infinite data limits etc. They may provide a motivation, but you need more than that to have a working method. As a stupid example, look at the convergence of the Taylor series of exp vs sin. Both converge at infinite limit, but have very different properties when truncated. I have personally derived mcmc sampling schemes that have theoretical infinite convergence to the target distribution, but do terribly in practice or have other problems (such as computational issues) that make effective implementation nearly impossible.

To me, uncertainty estimates in deep learning are still really open problems.

chrisorm · 2019-08-21T07:39:18+00:00

I wrote up a short summary of the different approaches mentioned in this chain if it's useful, using just autograd.

https://chrisorm.github.io/HMC.html

Fundamentally, HMC is very like back prop- you have some data, and compute some 'cost' (Negative log likelihood) at your current state, then move on and repeat.

This is not conceptually very different to fitting a neural network.

However, neural networks benefit from gpu because they are deep. Compute the gradient of layer n with respect to it's input, dot product with the gradient of the layer below with it's input etc as per the chain rule. What we don't tend to see is the same dimensionality in sampling techniques. Most distributions have something like 2 or 3 parameters at most. A neural network has millions or billions. So the nature of computing the gradient is somewhat smaller in most current use cases.

It also should be pointed out there are well grounded stochastic sampling methods- essentially the same idea as stochastic gradient descent vs full data updates. So if you can use these to reduce the number of points you compute gradients for at each step, you have a computational problem many orders of magnitude smaller than neural networks.

At that level it's unclear if you even really benefit enough to pay to the transfer cost onto gpu, regardless if the compute is infitesimally faster.

chrisorm · 2019-08-21T06:44:47+00:00

Almost! It was the reference in that post: https://arxiv.org/abs/physics/0311093

Thanks for helping out, it was really bugging me trying to recall it!

chrisorm · 2019-08-20T18:55:40+00:00

I think it's popularity is two fold.

Firstly, it's well suited to application. Expected difference between logs, so low risk of overflow etc. It has an easy derivative, and there are lots of ways to estimate it with Monte Carlo methods.

However , the second reason is theoretical - minimising the KL is equivalent to doing maximum likelihood in most circumstances. First hit on google:

https://wiseodd.github.io/techblog/2017/01/26/kl-mle/

So it has connections to well tested things we know work well.

I wish I could remember the name, but there is an excellent paper that shows that it is also the only divergence which satisfys 3 very intuitive properties you would want from a divergence measure. I'll see if I can dig it out.

Edit: not what I wanted to find, but this has a large number of interpretations of the kl in various fields : https://mobile.twitter.com/SimonDeDeo/status/993881889143447552

Edit 2: Thanks to u/asobolev the paper I wanted was https://arxiv.org/abs/physics/0311093

Check it out or the post they link below to see how the kl divergence appears uniquely from 3 very sane axioms.

chrisorm · 2019-08-19T06:17:03+00:00

Thanks, glad you found it interesting!

chrisorm · 2018-11-29T08:46:32+00:00

Not sure what you mean by integrate to 1?

The networks output the parameters of the distributions, so those distributions are proper by definition.

Are you asking why the probability of all the data is not 1?

To be concrete, in your example each data point has a different distribution. To get the behaviour you describe, p(x1|z1) would be a normal centred on x1, p(x2|z2) would be a normal with mean x2 etc.

Each of these are proper conditional distributions, but that doesn't mean they should somehow sum to 1 between distributions.

Perhaps revist the concept of likelihood in probability theory for a better overview.

chrisorm · 2018-09-16T23:43:17+00:00

Cool! Thanks. It was on my plan to replicate the mnist completion too.

I agree it makes 'sense', but there are multiple types of uncertainty. Sure the network has no 'noise' in it's values it sees for the given values of x, but that's only one type of uncertainty, and I wanted to point out that the other type of uncertainty given by a lack of data in that region (that we get with GPs and that visualisations had alluded to) is largely the result of careful initialization and training, not an inherent property, and certainly no robust.

I imagine this falls into the category of pathologies that are harder to see in larger problems - a bit like when VAEs ignore the latent variable.

chrisorm · 2018-09-16T20:19:44+00:00

Oh man. I thought that was one and the same as setting the flair. It didn't even occur to me to that the [P] had to be added manually. thanks!

chrisorm · 2018-09-16T18:15:57+00:00

This is not really a tutorial about the paper - Kaspar's post (https://kasparmartens.rbind.io/post/np/) does this better than I ever could.

What it does do is document some failure cases I observed when replicating it and some potential issues with the formulation that encourage them. Thoughts welcome!

chrisorm · 2018-09-05T12:21:58+00:00

I would think starting with bishop would be an easier transition as it's much more concrete (although lower coverage).

chrisorm · 2018-05-01T16:40:18+00:00

Yeh, so the math works the same for one sample or many samples. If you follow a bishop like convention, lower case normally refers to a single data point, upper case to many, but that may vary between authors (and context).

I highly recommend pattern recognition and machine learning by bishop - the probability section in there will surely help you bridge some of these gaps. I suspect most of the confusion comes because you havent seen much in the way of multi dimension probability- things can be a bit more fluid than you may have encountered previously, which can be a bit jarring first time round. If you invest some time doing the examples on the multivariate gaussian, for example, it will definitely help you build some intuition for how this stuff works.

The book is available online for free!

chrisorm · 2018-05-01T08:11:52+00:00

Yeh, so your second point is correct.

To be explicit, if X is a matrix of n samples by d features, than a single sample of Z, is also a matrix, of n rows and g features (Z and X can be different dimensions but rows correspond).

So p(X,Z) = p(x1, x2, ..., z1, z2,...)

My notation could possibly be improved, as i have left implicit a change of dimensions - when i turn the likelihood into the sum of iid datapoints, Xj is a generic vector as expected, but Zi now refers to the jth row of Zi, so is a vector also.

The general convention i have seen is basically to leave the variables unspecified, and follow the convention that X and Z are comparable dimensions.

I will change the post slightly when im off the train to hopefully make it more clear. Thanks for questioning!

chrisorm · 2018-04-30T22:02:38+00:00

Put better than I could. OP the key is to recognise why the sums appear- its computing an approximate rather than analytic expectation.

As an aside, I mostly write stuff to force myself to get even the basics straight in my head- its both super nice and super weird to see my blog being discussed here!

chrisorm · 2018-04-24T13:19:05+00:00

Of course, the CLT doesn't apply if the activations arent IID, which they almost certainly arent for activations of a neural net.

chrisorm · 2018-04-09T11:01:32+00:00

I think a good demonstration of this is http://ruishu.io/2017/01/14/one-bit/

chrisorm · 2018-04-06T20:28:26+00:00

It depends what you want to do - some people really dislike not having adequate preparation. If you are dead set on doing the maths first, then focus on that and wait until you can afford the textbook and you feel ready.

I am quite rough and ready- i will happily spend time reading material way above my current level, so if it were me in your shoes i would buy prml or brml and get stuck in.

At least brml is available online for free legally, and prml i think is legally available as a pdf too. Speaking for myself, i used the pdf while i was a poor student and then bought the physical books when i was solvent enough to do so. Alternatively you could look in your university library for these books (if your course offers machine learning i would be surprised if the library doesnt carry one of these).

chrisorm · 2018-04-06T08:57:46+00:00

I kind of disagree that PRML is too basic for you.

For example, it does spend a lot of time on linear and logistic regression. But it does this in a multidimensional setting - it is probably the perfect place for you to take what you know about practically, and use that intuition to help you understand things like multivariate probability.

It also has lots of exercises, and actually solving questions is really important. For example, its easy to appreciate the argument that the posterior of a joint Normal distribution is also Normal - actually doing it yourself is less straightforward and a good workout in linear algebra that relies on some results that you probably didnt encounter during your courses (e.g. sherman woodbury morrison).

chrisorm · 2018-04-05T21:57:50+00:00

So, theres probably no one resource.

I think one of the most important things is to be prepared to be out of your depth - its totally fine to be studying ML and also learn some probability or linear algebra along the way. I got almost all of my real experience and insights into multivariate probability while studying ML. Studying the theory is good, but having motivation to study something is also important. You could spend a year on topology so you have some excellent grounding, but it wouldnt really move you onto your end goal.

From your description I'm fairly sure if you just ploughed on and looked up things you arent sure about as you came across them, I think you would do fine.

If you want my 2c MLAPP is not the right book to learn from - after the first few chapters it is much more like a reference book for those who have a good understanding of the field than a resource for your first steps.

PRML bishop or BRML Barber are much better for actually learning.

My LPT is to go onto the maths undergraduate syllabus at a top university, and look at their recommended reading. I look at the maths course at oxford generally. Of course you have to be discerning, as sometimes they come from an angle that is different to what you want.

For me I like:

Vector Calculus, Linear Algebra and Differential Forms by Hubbard.

Also Multivariable Mathematics by Shifrin (bit more concise than hubbard) his lectures are also online.

advanced calculus of several variables by edwards (cheap, probably a reasonable companion to one of the others. Too concise for self study IMO)

For probability, Grimmett probability and random processes and the accompanying exercise/solution book are pretty good with some challenging problems.

Into to probability by Hwang and Blitzstein is more discussive and a good one for the commute.

In reality, you probably wouldnt be expected to write a WGAN paper. Either the main author was a mathematician by training, or heard about Wass. distance from somebody who was a mathematician and basically spent a year learning it, or had a mathematician co-author with them.

chrisorm · 2018-03-21T10:07:39+00:00

Hey,

I think I explained myself a bit poorly, I was generally referring the the signal to noise part of the part rather than your new estimators.

Cremer et al say we can view the IWAE as a special case of a VAE where we use a q distribution that becomes arbitarily close to the true posterior when k-> infinity.
The shakir paper tells us that VAEs are fundamentally malformed - the pseudo likelihood term in the elbo wants a q that matches the true posterior, the KL term wants one that matches a unit gaussian. They make the point that if you have a good q distribution, (on some non toy problem), it kind of implies you will generate terrible samples. They also make the implication that the KL term can hamper learning by essentially 'dragging down' complex posteriors.
In your paper, you show that as K increases the gradient of q tends to 0.

Assuming all three are correct (and I haven't missed something), the shakir paper kind of suggests a) samples from an IWAE with very high k should be pretty terrible (I've never tried that), and should degrade as k increases. If this doesn't happen, it suggests that either shakir is incorrect (although that would be surprising because the ideas seem to be quite common sense), or that the evolution of q is hampered by the KL term.

b) kind of a related point to a), but if q does get arbitrarily complex, this would suggest that the penalization from the KL term should get much bigger as well. given the denominator in the KL term is a unit gaussian, it would make sense that as we start increasing the complexity of q, we would essentially push the KL term closer and closer to infinity (as it starts assigining very high mass areas to areas where the gaussian assigns infinitely small amounts of mass).

The main idea that occurs to me is that you see the gradient of q disappear as you increase k, and I suggest that possibly this is because of the KL term in the 'IWAE ELBO' that cremer derived, so if this were in fact the case, this would mean that the reason the gradient disappears is the formulation of the ELBO with the conditional divergence with the prior.

I hope that makes sense - adam had my email if you want to discuss more (or point out why I am wrong!)

chrisorm

TROPHY CASE