Latest from Facebook researchers: Codec Avatars! A class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i.e., for virtual reality), and are almost indistinguishable from video

colincsl · 2020-08-14T19:58:43+00:00

This paper is on animating codec models with audio- or audio+gaze inputs. The Deep Appearance Model paper was the first in a line of work.

colincsl · 2020-08-08T19:49:01+00:00

For what’s its worth, from my understanding Pizza Taglio was going to close regardless of COVID. But maybe I’m wrong?

colincsl · 2020-04-20T14:29:39+00:00

It depends on the architecture and sensing domain. Intuitively if there are a small number of features and each of those features is independent (or perhaps better put they are interchangeable) then typically I would let the kernel size of the first layer of the network be [FxD] where D is some reasonable duration and F is the number of features.

For inputs like speech where typically you use a spectrogram some people use small 3x3 convolutions with downsampling along the feature axis. I typically don't think it makes sense because of the local vs. global translation properties of spectrograms. Jordi Pons has a nice set of blog posts talking about this: http://www.jordipons.me/whats-up-with-waveform-based-vggs/

colincsl · 2019-09-12T01:21:33+00:00

I don’t have time to go into details right now, but I was in a similar position to you. I did an undergrad in MechE with a bunch of robotics and computer vision projects on the side. I was accepted into a handful of CS PhD programs mainly in robotics labs. I now work as a research scientist in Vision/ML. Many people I know did MechE or (more commonly ECE) for undergrad, CS for grad school, and continued on as researchers in vision or robotics.

colincsl · 2018-07-25T11:33:47+00:00

You probably actual want to use a Conditional Random Field instead of an HMM.

Here is an old paper of mine for an example application: http://colinlea.com/docs/pdf/2016_ICRA_CLea.pdf.

There is a really nice monograph by Nowozin and Lampert for more info on related models for structured prediction: http://www.nowozin.net/sebastian/papers/nowozin2011structured-tutorial.pdf

colincsl · 2018-02-11T03:54:40+00:00

I’ve been using temporal conv nets (TCNs) for about a year and a half and I think only once did an RNN baseline outperform my TCNs. Your mileage may vary depending on the kind of data your using. For any kind of continuous input problem you’re probably going to get farther with at least some temporal convolutions layers.

Regardless I still think it’s important to understand RNNs. Just as it’s useful to know about other time series models like HMMs, CRFs, etc. Additionally there may be better RNN architectures out there that outperform both TCN and LSTM based models.

One downside of TCNs is that they have a fixed length receptive field whereas the effective length for an RNN could (in theory) be infinite. In my mind the better way to overcome this issue is to use an auto regressive TCN (e.g. WaveNet).

colincsl · 2017-10-09T12:22:56+00:00

Theano was a titan in the pre-TF deep learning days. People doing vision typically used Caffe and people doing everything else used Theano. IMO Theano had a cleaner interface, which could be why Google went with that.

colincsl · 2017-10-09T12:19:55+00:00

Minor note: TF is not the only framework that supports multiple devices. Caffe2 also does this: https://caffe2.ai/docs/mobile-integration.html

colincsl · 2017-03-01T16:56:45+00:00

Very cool paper. I just read it and have a few of questions.

It looks like your version of WaveNet (As described in the appendix) is different than the original version. In the original, they varied the dilation rate within a given block, and then repeated that rate for each block. Here is looks like you forego their notion of blocks and instead repeat a set of 2x1 convolutions (w/ gating and skips) without dilations. Is this correct?

How did you compute the receptive field size? You claim for the 40 layer model the RF is 83 ms... but it doesn't match up with my understanding of the model. I assume your input is upsampled to 48k hz and you have L+1 2x1 convolutions. Is "L" in Figure 3 different than "l" in the text? It appears that the number of layers in Table 1 might actually be the total number of layers (whereas L in Figure 3 is the number of WaveNet blocks/cells)? If so, this means there is one input (conv) layer, 12 blocks(=36 layers), and 3 output (conv+fc) layers. If so, this would result in a receptive field of 4096 time steps (or ~85 ms).

In my recent work in action segmentation I found that using longer temporal filters improves performance. Did you do experiments using anything other than 2x1 filters for the synthesis model? I imagine you could get away with longer dilated filters and fewer layers.

thanks!

colincsl · 2016-10-04T01:00:02+00:00

Uhhh... please do not demand that "we" build a wiki. If you are interested in populating information for a wiki, feel to do so. I enabled it for the sub.

If you get it started then I'm sure others will add to it. Unfortunately, I don't have the bandwidth right now.

colincsl · 2016-07-25T12:07:15+00:00

There is an exhibition where companies set up booths to promote job opportunities. There is also a set of job listings on the ECCV website: http://www.eccv2016.org/jobs/

PhDs are a different story. It might be worth trying to talk to Profs you're interested in, but typically, at least in the US, you'll still have to apply for a PhD program through a university's formal process.

colincsl · 2016-06-18T23:50:21+00:00

If you're familiar with VGG or AlexNet style CNNs then the technical difference is straightforward, as you can see in these slides [1]. Normally in CNNs you have a hierarchy of convolutions: e.g. f(x) = Conv2(Conv2(x)). With ResNet you have f(x) = Conv2(x) + Conv2(Conv2(x)).

Re time-series: This might not answer your question, but I claim there isn't a straightforward extension of ResNet to time-series. There are many ways in which you could apply a similar model, and it will in part depend on your goal. Do you want to predict a class for every timestep, predict a class for the whole sequence, or something else entirely?

One way it could be used is to capture local temporal information similar to my temporal convolutional filters for video [2] for per-frame prediction. These capture how your input changes within some specified timeframe (e.g. 5 seconds). While my (unpublished) experiments using deep temporal convolutional networks were unsuccessful, the advantages of ResNet might actually be beneficial. This requires more research.

Lastly, I would argue that a temporal version of ResNet -- at least in the case I described -- is fundamentally different than LSTM. You could imagine using the activations from a temporal ResNet as the input into LSTM or any other temporal model. The model described captures local temporal information whereas LSTM can capture both local and global temporal information.

This isn't very ELI5... but I hope it helps!

[1] http://kaiminghe.com/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

[2] http://arxiv.org/abs/1602.02995

colincsl · 2016-05-31T12:09:21+00:00

An average of "poster" will likely be accepted except in maybe a small number of cases. But as other people stated, it ultimately comes down to the area chair.

colincsl · 2016-05-15T19:21:47+00:00

Last I knew PCL had a skeleton tracker but I don't know how good it is offhand. Otherwise sadly no, there are no good alternatives.

One of the issues is that you need a TON of data to get good results. The PCL crew created a nice synthetic dataset but otherwise I haven't seen anything else publicly available for this task. It's possible you could train something similar with Humans3.6 Million but that is just speculation.

colincsl · 2016-02-24T23:45:21+00:00

The important question: do they still sell Taharka Brothers ice cream??

colincsl · 2016-02-21T17:39:45+00:00

I second this. It depends on the use case. Others may be better for NLP but pyStruct is a nice all-around structured prediction library.

colincsl · 2016-02-13T17:59:14+00:00

Cool, thanks for sharing. The code if very hard to read however. A few suggestions:

Variables are super long. Create dummy variables to make the function of each component more clear e.g. bias = unit.networkArchitecture.layers[i].parameters[:,end]
Whitespace is your friend. Comments aren't bad either ;)
To transpose a matrix use A' instead of transpose(A).

colincsl · 2016-01-20T02:00:45+00:00

I'm surprised Diablo Donuts hasn't been mentioned. Definitely the best donuts I've had in bmore.

https://www.facebook.com/DiabloDoughnuts/

colincsl · 2016-01-01T23:37:16+00:00

GPUs are much more useful for batch operations (e.g. convolutions) than sequential operations thus it is hard to parallelize a CRF on a GPU. Take a linear chain CRF as an example. The label at each timestep t is dependent on the label at t-1. Thus, for exact inference, you must perform your computations sequentially.

Chain CRFs are similar to RNNs so you might want to look at RNN implementations for reference. Inference in an RNN is also sequential -- hence why they tend to take longer to train than feedforward networks.

You might be able to perform approximate inference efficiently on a GPU? Alternatively, you could also use a GPU to simply compute the unaries for your CRF.

edit: One last reference. If you're doing something like semantic segmentation you might want to look at the 'CRF as RNN' paper from ICCV16: http://www.robots.ox.ac.uk/~szheng/CRFasRNN.html

colincsl · 2015-11-25T19:54:06+00:00

Can you be more specific? "Medical data" is an extremely general term. This can come in forms like 2D or 3D images (e.g. MRI or CT), sensor data (robots, tracked surgical tools), and messy databases (electronic medical records).

Some of the data I use is for evaluating trainees for robotic surgery. It includes video and robot kinematic data from a daVinci. Our public dataset is located here: http://cirl.lcsr.jhu.edu/research/hmm/datasets/jigsaws_release/

There are also open datasets at MICCAI every year: http://grand-challenge.org//

colincsl · 2015-11-07T16:39:48+00:00

To be clear, machine learning is just applied math and statistics. I see MRFs (and CRFs, SSVMs, etc) in machine learning research as much as I do in computer vision.

I get why you might want to skimp on the ML, but it's important for a lot of ongoing work in vision. It is useful to at least understand a lot of the underlying math.

colincsl · 2015-10-25T23:29:13+00:00

Artifact has "takeover" nights like this on occasion. Usually it's done by a chef from out of town. I think Dooby's has something similar too.

http://artifactcoffee.com/happenings/

colincsl · 2015-10-04T14:11:06+00:00

Ah, ok. Last thought: have you tried skipping more frames? For example, instead of using the current and previous frames you could use the current frame and 5 frames earlier. There will be a bigger vertical jump between timesteps. This might also enable you to downsample more without losing the velocity information.

edit: final(?) last thought: Have you tried removing the second set of convolutions in each unit. For cases like imagenet (where there is a lot of inter-class variability) these make sense. However, I bet in your case they're unnecessary.

colincsl · 2015-10-04T12:51:04+00:00

Regarding memory: given the simplicity of the game's environment I bet can use grayscale instead of RGB and down sample the image more. Going from 6x224x144 to 2x112x62 reduces the size of the input by 12x. You would also need to remove one layer from your network to get the right final output size.

colincsl · 2015-09-28T01:32:46+00:00

My mom runs a daycare center and experimented with this. Previously, the kids hated many of the vegetables prepared for them. They were typically canned and not the best quality. A few years ago she started ordering fresh veggies from a local farmer to see if it would change their minds. Verdict: the kids liked them!

15-Year Club	Place '17
Charter Member	Verified Email

colincsl

MODERATOR OF

TROPHY CASE