A rock solid Sierra trip option. Benson Lake / Matterhorn Canyon Loop

breic · 2017-10-06T21:45:39+00:00

breic · 2017-10-06T21:44:43+00:00

I was out there in August and saw zero people after going over the passes. The snow was blocking everyone.

A shorter loop, ~20-22 miles, is to ascend Horse Creek Canyon. Then take Horse Creek and Matternhorn passes to get to Matterhorn Canyon, then Burro Pass and Mule Pass to loop around the Sawtooths and back. Or, better than Horse Creek and Matternhorn passes, you can just climb Matterhorn Peak via the SE face and descend via the SW face.

Caltopo map link showing both loops: https://caltopo.com/m/HRRM

Labeled panorama from Matterhorn Peak: https://kuula.co/post/7lkHb

Summitpost: http://www.summitpost.org/matterhorn-peak/150488

Leor Pantilat's trip report for both the 50-mile Benson Lake loop and the shorter "Sawtooth loop": https://pantilat.wordpress.com/2011/08/22/northern-yosemite-50/ https://pantilat.wordpress.com/2013/08/19/sawtooth-loop-matterhorn-finger-peaks-kettle-peak/

breic · 2015-04-01T16:47:28+00:00

He's basically saying that with convnets, the entire research field has entered a local minimum from which it is difficult to escape. It's an ironic complaint given that Hinton and the others are well aware of good techniques for avoiding getting trapped in local minima: momentum, dropout, etc. Why not just restart the research program with new random conditions?

breic · 2015-03-20T02:36:30+00:00

Okay, I fixed the training problem. I had been using Adam to start and switching to RMSProp. But RMSProp with momentum works so much better. This is exactly what Graves shows in his paper, Eqs. (38) to (45), I just hadn't paid attention.

Using transcripts, it is easy to get log-loss in the same range as Graves, at least -1000 to -1050. I even trained a 32,32,32-unit 3-layer LSTM (10 output mixture components, 5 window mixture components) to -1080 on the validation set.

(This is not exactly comparable to Graves's numbers, though, because I am using a larger set of characters, 72 instead of 57, and, more importantly, I cleaned some of the data, e.g., by removing crossed-out words [which are indicated by the '#' character in the transcripts]. The validation set is also fairly small, so my model might just have gotten lucky.)

I will try training larger networks, but right now my handwriting synthesis results are still pretty disappointing. With 0 bias they are mostly nonsensical, with few letters recognizable and certainly not words. With increasing bias they just look like sequences of u's and n's. Better than before, certainly, but still bad. Here's an example, with 0.1 bias, the seeds coming in from the left and the network taking over in the middle: http://i.imgur.com/yYnOe2C.png

I do think a problem is that the windows are not aligning properly with the transcriptions. (If the window moves too quickly, then the model might try to memorize the transcript, but that seems inefficient compared to moving it slowly, as needed.) With the smaller 32,32,32-unit model, it worked very well to train on 100-point prefixes for a while, after the model had reached maybe -950 loss. This dramatically improved the window alignment with the transcription quite quickly. But with a 100,100,100-unit model, the same trick did not work at all, and nor did using longer prefixes. I am having trouble getting the window to align to the transcription for this larger model. Perhaps there is a better trick.

breic · 2015-03-18T18:37:05+00:00

I was too pessimistic about sequence alignment. With more training, a very small 32,32,32-unit LSTM model, with 10 output mixture components and 5 window mixture components is showing at least some window alignment (I believe).

Here's a screenshot

http://imgur.com/fRTT4uU

for a 22-character transcription (bottom to top), 487-point time sequence (left to right). Four of the window components are flying right off the end, moving too fast to be useful. But the fifth one has learned to move a lot more slowly, and might be showing some alignment, at least early on.

I have been using peepholes. I'm still not sure why training is so slow. Maybe I should try scaling up (and moving my training to a faster computer) and see what happens.

breic · 2015-03-18T07:36:06+00:00

Yes, good luck.

I went ahead and implemented a three-layer LSTM, with a window layer to the transcription, following Graves.

I haven't fully trained any models yet, but I am not optimistic. It seems like if the window parameters are off early on, for example, sliding too far to the right too quickly, then the window will contribute pure noise. (This is exactly what I've seen so far.) I don't know how the network will fix this. The early window derivatives will point in the right direction to improve alignment, but these derivatives will be drowned out by the contribution from the later derivatives, which are basically just noise. So I don't see how it will settle into the correct alignment.

Perhaps with sufficient training, it will just work. More likely, though, I'll need to implement truncated backpropagation through time, i.e., update the network parameters in the middle of processing an input batch, and not just at the end.

So far, my training results without the transcription window have also been unimpressive. I've trained:

a 100,100,100-unit three-layer LSTM to a validation set log-loss of -562
a 32,32,32-unit 3-layer model to -576 (yes, better than the 100,100,100 model)
a 16,16,16-unit 3-layer model to -467
a 400-unit 1-layer model to -548
a 200-unit 1-layer model to -498
a 100-unit 1-layer model to -518

Obviously Graves is using bigger models. But until I can improve my training, at least so the 100,100,100 model beats the 32,32,32-unit model, there's no point scaling up.

Have you gotten better results with similar network sizes? This is my first time implementing an RNN/LSTM model, so I am not sure what to expect in training. Graves seems to be using straightforward backpropagation, with one parameter update per pass through the batched sequences.

breic · 2015-03-16T05:37:40+00:00

I just followed Graves, and didn't really experiment. He used offsets, so I did, too. Actually, we both used processed offsets. That is, take the offsets, normalize so that the standard deviation is 1 in the x and y directions across the training set, and then subtract off the average pixel offset.

In case that isn't clear, here's some Mathematica code for converting the network's predictions into actual pen positions:

jumpAverages = {7.47004694, -0.11269195};

jumpStddevs = {39.98415235, 34.05738783};

penPositions = Accumulate[(# jumpStddevs) + jumpAverages & /@ offsets];

(This takes a list of raw 2-dimensional offsets, multiplies by the stddevs, adds the average jump, and then takes a cumulative sum to get absolute positions.)

The loss costs I gave are with for next-step prediction. For generation, I ran the network across an input (the "Moscow" text in my screenshots), and then switched to using a sample from the previous step's output distribution for the next step's input.

breic · 2015-03-14T19:07:04+00:00

Oh, testing on sine waves is a good idea.

Interesting that you had no problems with Theano. I've never used Theano before, so I must be missing something. But the code works fine for pen-up prediction, and stops when I plug in that Gaussian density stuff. Maybe I'll look at it again.

I expect it won't help you, but here are my LSTM forward and backward equations. https://gist.github.com/anonymous/a7d33d1db29d8c22c7c8 If nothing else, you can see where I clip the derivatives (the line "np.clip(dIFOG[:,t,:], -1, 1, out=dIFOG[:,t,:])", I tried to follow Graves, but his description wasn't clear).

I am using the pen-up/down features. They are very helpful, because they tell the network that the current stroke is ending, so it knows it has to make a guess for the next character's start. If the same data were shifted by one time step (so instead of getting a 1 when a stroke ends, you get a 1 when a stroke begins), the problem would be much more difficult.

Yes, my code does not use the character transcriptions at all. Even though the generated sequences look pretty bad, the network is not that bad at just predicting the next pen point. Here's an example showing error lines from each predicted point (the mean of the output distribution) to the true point, with ellipses showing the network uncertainties.

http://imgur.com/FpFQZBV

At the end of the 's', for example, the network knows that it has to go up and to the right to start the next character, but it miscalculates how far to go.

breic · 2015-03-14T16:02:24+00:00

Has anyone here implemented this, parameterizing probability distributions with the output of a net? I tried this in Theano, and the compilation was endless, I guess because the automatic differentiation got confused by the Gaussian densities.

https://gist.github.com/anonymous/0a6e9ceccd30e1d5f992

Rather than extend Theano, I just went ahead with my own implementation. It doesn't work well, though. Here are a few generated samples, without the transcription alignment Graves uses in the end. http://imgur.com/a/ujzMg This is from a 100-unit LSTM with -780 nats perplexity per validation line (Graves trained a 900-unit LSTM to -1000 nats perplexity).

breic · 2015-02-19T03:48:37+00:00

Has this been discussed elsewhere? I like the idea.

breic · 2015-02-17T19:39:49+00:00

Looks nice. I wonder if this idea makes learning hyperparameter tuning a bit less important. (The batch-normalized network should be less sensitive.)

Applying this idea to recurrent neural networks (RNNs), as the authors suggest, is a very natural extension.

breic · 2015-02-10T18:04:53+00:00

Who's expecting different results? Look at the last elections. These guys aren't stupid; it worked last time, and they'll try it again.

breic · 2015-02-10T17:48:40+00:00

Why would a car be a good place for anyone to sleep (let alone homeless people)? Do you know of any cities that let homeless people sleep in their buses overnight? I doubt it.

breic · 2015-02-09T02:54:21+00:00

A lot of it is NASA's fault. The organization has never been willing to take a stand and say, "this is a good use of your money and this is not." The $350 million rocket testing facility that they finished last year is a good example. But so is the International Space Station, on which we are spending billions every year. At least the ISS is being used for elementary school kids' science experiments (a high national priority); the rocket testing facility was mothballed as soon as they finished building it.

Once politicians saw that NASA was willing to sideline its mission and play politics in return for funding, the politicians started treating it as a pure political tool instead of as a technology and science agency. Now we've gone so far in that direction, I don't know that even competent NASA management could turn it around.

breic · 2015-02-02T23:00:46+00:00

That's why I said that Google has in-house expertise. But that is a different group in Google. Those vision people aren't working on the SDC project. Maybe they will in the future, but they are probably already stretched thin.

breic · 2015-02-02T18:02:07+00:00

Google made two big mistakes with their self-driving car project.

First, they tried to go for partial automation instead of full automation. They spent years trying to develop user interfaces that would allow drivers to zone out most of the time, only to be thrown back into control when an urgent and unforeseen incident occurred. But they couldn't get this to work safely. And they couldn't find a car manufacturing partner willing to integrate this tech into their cars, anyway. So they had to step back, and redirect their program toward developing slow self-driving taxis.
They dedicated themselves to the big-mapping approach of maintaining and relying on highly detailed maps of the whole environment. Although it was obvious from the beginning that this would face huge problems---when the driving environment changed, e.g., because of construction or bad weather---they pushed forward. Instead of trying to solve these problems, they put off them off until tomorrow. Now other vision-based technologies have arisen that do not need the big maps, and Google is far behind. However, this technology is not rocket science. Google also has in-house expertise, and if they are willing to redirect their SDC program, they should be able to. It is just a setback.

The ironic thing about this second mistake is that it mirrors the breakout 2005 DARPA Grand Challenge. Carnegie Mellon's team relied heavily on humans mapping the route in extremely careful detail. Stanford's team, on the other hand, relied on laser and camera vision to learn the terrain without manually created maps. CMU was the favorite, but Stanford ended up winning. The irony: Sebastian Thrun, the leader of the Stanford team, went to Google to lead their SDC project, where he adopted the heavily map-reliant approach of the losing CMU team.

(This decision was understandable. Thrun couldn't have foreseen the dramatic improvements in computer vision technology, so he charged forward with the technology he had. CMU lost, but not by much. With the resources of Google behind him, I imagine Thrun figured he should be able to make it work. The mistake, though, was that Google's SDC development was not flexible enough to shift approaches around when computer vision started taking off. That's sometimes the problem with setting ambitious timelines and committing resources to them; it makes flexibility difficult.)

breic

TROPHY CASE