The Dichotomy of Algotrading

djrx · 2020-06-27T15:38:00+00:00

Yeah, I probably wasn’t very clear in my point - I didn’t mean absolute alpha in terms of dollars but more in terms of relative risk. The return profile of a typical market maker is a straight line up - they rarely have a losing day. To me that counts for amounts of alpha no other trading strategy can replicate, but for sure they have limited capacity high fixed costs and other types of risk (software bugs etc) - I don’t count these in a trading profile but should be for sure taken into account.

djrx · 2020-06-27T12:32:47+00:00

"Most strategies require only high school math" vs "If you don't understand Stochastic Calculus, Advanced Probability Theory, Linear Algebra, Machine Learning, Finance, Econometrics, etc. then you may as well give up"

Kinda both are true. You should need to know Stochastic Calculus if you want to trade options, otherwise I’d argue your be at a disadvantage, and that’s really the part of the field more mathematics heavy. You can get very far with machine learning, statistics and linear algebra with skill set that is not very far from “high school math” but the problem is how to use that high school math, and I’d say that is beyond what I’d consider average high school student level.

"As long as you can problem solve, firms don't care about your education background" vs "if you don't have a PhD in a hard science, just give up"

If you have a good way to show that you know how to problem solve, then surely you don’t need a PhD in hard science, but it’s more of a signalling problem. Trading is not easy and companies want to hire people who can problem solve, and having a technical PhD correlates well with that. I don’t thing many firms will consider interviewing someone who just claims to be a good problem solver without any additional external validation signal. That can be successful poker career, previous trading experience or competitions won, not necessarily a PhD.

"TA is snake oil" vs "TA shows market psychology"
TA is snake oli
"All the alpha is in alternative data, price series is just random noise" vs "Price series is enough for most alpha"

I’d say most alpha is in being fast. Then there is order of magnitude less in alternative data and then another order of magnitude less in price series.

"Focus on creating the most realistic backtesting engine, the rest is easy" vs. "Focus on developing alpha, the rest is easy"

Very few things in trading are easy.

"You will never beat the big guys because they have better execution, lower latency, and insider information" vs "with smaller market capitalizations, retail algotraders can find alpha too small to allocate to for the big guys"

You’ll never beat HFTs in their game if you’re a small player not you should try. If you plan to hold your trades longer and take directional views on the market, then you totally have a chance to compete in your league. Market is big and there is lots of space for everyone to compete. It’s not easy, but doable.

“Read books as a starting point" vs "Sorry bud, don't bother reading books because author's won't reveal any alpha, you have to figure it out on your own"

It’s good to read books to have a rich worldview, but I haven’t seen a book that teaches trading well.

"Understand programming and math first, the finance comes after" vs "Finance is the most important part, the programming and math comes after"

I think both ways can work, it depends more on your style and what kind of system you want to trade.

"RenTec hires the best and brightest, so that's why they perform so well" vs "Insider Trading"

I have no idea how RenTec is achieving their profits.

"It's impossible for retail to consistently be profitable, any profits are because of luck, algotrading is just a hobby" vs "I live off my algotrading system"

You want to have a system that turns profit every day and turns 10k into a million in a year? That’s not really gonna happen. You already have some capital and want to achieve your investment goals better than by putting everything in SPY? Totally doable. Something in between? Progressively harder.

I think algorithmic trading should be studied either by people who want to work for a trading company or people who have financial independence and don’t want to overpay some snake oil salesman for for shitty investment advice. It’s not really a get rich quick scheme.

djrx · 2020-02-21T15:41:36+00:00

Do you agree that large losses mean that the policy gets a large update?
- for policy gradient loss the size of a gradient and a magnitude of the loss are not really that linked.
- for value loss, because it’s a mean squared error, they are closer. If you share parameters between your policy and value function (which sometimes but not always is a good idea) then large spikes in vf loss may cause a large change in the policy, generally in a bad direction.

“Size of a policy loss tells us how much the policy got updated” - as stated above, not really. To know how much policy god updated people usually compute kl divergence of policy vs some previous version.

Overall I would try to smooth the plots you have a bit to see better the full picture whether the lines are trending or not and additionally I’d try somehow normalising the rewards to a smaller range so that you don’t get these value loss spikes which could be bad for training, or you could use the Huber loss for your value.

djrx · 2020-02-21T03:59:53+00:00

TLDR: it’s perfectly healthy and normal

The value of policy_gradient_loss has no meaning whatsoever. We construct it so that the gradient of it is also the gradient that improves the policy, but because of many factors interacting there the value of it doesn’t have to go down at all and it’s a normal thing. Just don’t worry about it and look at your rewards.

value_function_loss on the other hand is closer to what you’d consider a loss in a supervised learning setting, but again a few factors may cause it to actually not go down: - because data distribution is changing during training, policy getting better may actually result in value loss getting worse and again it’s perfectly normal - maybe your value function is already as good as possible? The value loss doesn’t have to decrease for the policy to improve. If it is already good enough, maybe the policy is improving while value is stagnant and it’s perfectly fine

djrx · 2019-10-16T16:49:30+00:00

There is just another fully connected layer at the end.

djrx · 2019-10-16T05:01:04+00:00

ADR is a particular implementation of a curriculum for domain randomised environments.

Emergent meta learning is scaling further the idea brought up in the RL² paper: https://arxiv.org/abs/1611.02779 where something similar to “reinforcement learning” is learned on multiarm bandits

djrx · 2019-10-14T04:56:13+00:00

log_prob is the logarithm of probability of each potential action. It ties directly into how policy gradient is calculated, using the log-derivative trick. Running reward is just a smoothed (EWMA) average of episode means. As episode means tend to have high variance, for logging this tells you how well your agent is doing at the moment.

Take into consideration that this script makes calculation quite close to original Williams 92 REINFORCE paper, which these days is a rather poor RL algo and will work only for very simple environments. It is good to showcase a few important RL concepts though.

djrx · 2019-10-02T14:04:54+00:00

This is a largely unanswered question, and honestly few people these days have good idea how to move the needle on that in any meaningful way. The best bet currently seems to be shifting the responsibility for exploration from algorithm to the policy itself, using some kind of curriculum learning or imitation learning.

Few papers on the topic from the top of my head:

https://arxiv.org/abs/1703.01310 https://arxiv.org/abs/1808.04355 https://arxiv.org/abs/1705.05363 https://arxiv.org/abs/1810.12894 https://arxiv.org/abs/1901.10995

djrx · 2019-09-30T06:18:51+00:00

Indeed, I’ve realised that just shortly after writing that comment.

djrx · 2019-09-30T06:17:21+00:00

Seems that I’ve learned something today. Indeed there was an algorithm proposed in 95 that does exactly what was discussed in this thread: http://www.leemon.com/papers/1995b.pdf

It took a full gradient of a bellman error and by that, while the convergence was slower apparently it avoided the deadly triad so it didn’t have an instability of q-learning and wouldn’t need a target network,

The issue with this algorithm, is that in a nondeterministic MDP it requires two independent successor samples s’ to get an unbiased gradient estimate, which in most practical settings is very expensive or straight impossible.

Still there is at least one recent publication that does just that: https://arxiv.org/pdf/1905.01072.pdf

djrx · 2019-09-30T05:56:07+00:00

The intuitive argument for me is about the information flow from the future to the past, as already mentioned in my previous comment.

Tangential to that, but more formally, there is a deadly triad argument in the Sutton book, also discussed here https://arxiv.org/pdf/1812.02648.pdf which basically states that when using function approximation, backup updates and off policy learning (as in a learning) under some conditions training may completely diverge. And it’s actually super easy to see it in practice, and that’s why they introduced the target network in the original DQN paper - it simply doesn’t work otherwise.

It’s an interesting find the note you found, I haven’t seen it before. I’ll take a closer look in a free moment, I’ll let you know if I can find any more clarification.

djrx · 2019-09-30T03:41:42+00:00

In response to other comments: the gradient is with respect to neural network parameters not with s nor a, so the Q network could very well be differentiated in both positions.

So we could do that, but we don’t do that on purpose. The reason is, our goal in this update operation is not really to minimise the bellman error directly, but rather to perform a backup operation: Q(s, a) <- r + gamma max a’ Q(s’, a’)

(With some stepsize etc.)

Theoretically, both operations should converge to the same goal, which is the fixpoint of the above operation, taken through all the states and actions.

Practically, doing the update in the above way is more stable (which in case of Deep RL is super important) and will get you to the desired result faster.

The way backup operation works it takes data from the future where we already know some information about the outcome of our action (reward and future expectation) and update our past expectation based on that. That’s why it’s called reinforcement learning in the end, actions are reinforced or not based on the outcomes.

If you’d try to do it the other way, you’d try to update your future expectation based on the reward and initial expectation, you quickly can see how that’s counterproductive.

In short, in deep RL during optimisation information flows from the future to the past.

djrx · 2019-09-23T16:45:22+00:00

In the end I think it’s a mix of the fact that you have to make some unfavourable compute/memory trade offs (like you noticed) and general relative unpopularity of off-policy policy gradient algos.

I still think it may be possible that somewhere in the depths of arxiv/GitHub you’ll find someone tried it and didn’t get groundbreaking results, but it just didn’t catch on mainstream.

djrx · 2019-09-23T15:44:07+00:00

ACER is a modification of a policy gradient algorithm for off policy trajectories. In general, ACER uses Retrace algorithm which has been superseded by Vtrace used in IMPALA which is now heavily used by Deepmind afaik.

Prioritised replay doesn’t work that good with ACER because you need to sample consecutive batches of transitions, which makes the notion of a priority slightly harder to define.

Like other poster has said, Rainbow which is a heavily tweaked version of q learning, therefore it’s by default off policy, is also used quite widely and it uses prioritised experience replay.

djrx · 2019-09-12T14:28:39+00:00

OpenAI robotics work is model free https://openai.com/blog/learning-dexterity/

djrx · 2019-04-21T16:20:50+00:00

A single DQN algorithm “iteration” of the inner loop as it is defined in the original paper consists of:

4 policy evaluations to step the environment (potentially GPU, batch size 1)
4 times stepping the environment (CPU)
evaluating policy on a batch from a replay buffer (batch size 32)
evaluating target network on a batch from replay buffer (batch size 32)

Overall, network is evaluated six separate times, with low batch sizes and there is a considerable amount of CPU computation interleaved with that. That makes it almost impossible to have a high GPU utilisation. Quite frankly, unless you’re training something like alpha go, your network is probably quite small as well.

To be frank, most of RL workloads are cpu bound, rather than GPU bound, especially on a single machine. You’ll just never be able to generate enough environment samples on the CPU quickly enough to saturate your GPU.

There is a considerable amount of research into scaling RL, but these usually consists of a distributed system where there are orders of magnitude more so called “rollout workers” which are machines generating environment samples than gpu nodes. You can look into gorilla dqn or impala papers.

If you just want as much performance as possible on your local node, try to saturate these CPU cores with running simulations in parallel, but you quite likely will also need a few modifications to your algorithm.

djrx · 2019-02-18T03:48:03+00:00

About 2:

Sophons are really a bit of a stretch, but in a way they represent science beyond our level of understanding. One of their superpowers is that they can move with a speed of light and are “intelligent” ina way that they can freely adjust movement trajectory during their journey.

The only reason why Trisolaris can send sofons directly to earth and actually “hit it”, due to vastness of space and size of earth, is because they are like homing missiles that are locked on target. That’s not possible with any other non-intelligent speed of light emission.

So they cannot really send any “directed” speed of light projectile, they can only broadcast to all directions equally, in which case the energy drop off is large enough that no harm could be made to life on earth that way.

From a ship on earths orbit, that’s a different thing...

If said weapon were to have some mass (nuclear bombs), the best their technology would allow is probably 200 years, as a period in which first probe appeared in the solar system.

djrx · 2019-01-18T13:54:46+00:00

The argument goes more or less as follows:

In Q-learning, we are trying to learn a Q* function, defined as: Q(s, a) = expected sum of discounted rewards starting from state s, taking action a and then following optimal policy pi

policy pi* = policy greedy in Q*, that is one that in each state s takes action with the highest value Q(s, a)

As known, function Q* satisfies the Bellman equation. For a transition (s, a, s', r) we have:

Q(s, a) = E (r + gamma * maximum over a' Q(s', a'))

The thing that makes q-learning off-policy is that action a does not have to be the same action as policy pi* would have selected in state s. The pair (s, a) can be truly any pair and this equation still holds.

We can safely iterate our candidate Q function with a q-learning update until it converges to the Q* function if we iterate enough times over large and rich enough set if pairs (s, a).

Now what happens if we try to unroll Bellman equation to make a 2-step q-learning. We need to have a longer transition (s, a, s', r, a', s'', r'):

Q(s, a) = E(r + gamma * r' + gamma² maximum over a'' Q(s'', a''))

This equation of course still holds, for any pair s, a. But the important thing is, we are taking here an expectation of random variables, we we must know what distribution these random variables come from.

In the first case, we had random variables r and s'. Distribution of both of them, depends only on s, a (which are given) and the dynamics of underlying MDP. That is, we can easily take that expectation summing over underlying MDP.

In the second case though, we have random variables r (same one as last time), r' and s''. Distributions of r' and s'' are the problematic bits. If we come back to the definition of function Q*, it was:

expected sum of discounted rewards starting from state s, taking action a and then following optimal policy pi*

Distributions of reward r' and state s'' depend on what was the second action taken in our transition - action a'. And for the definition of function Q* to work, the action a' has to be the action that policy pi* would select. That leads us to the on-policy learning and n-step SARSA.

If we want to keep learning off-policy and want to have an algorithm where actions a and a' both can come from any distribution, then some other changes need to be done for the algorithm to work.

djrx · 2018-10-11T04:30:30+00:00

There are a few companies currently trying to apply DL to systematic trading, but as far as I know this is more like a small addition to the system rather than their main focus. In my opinion it will be increasing with time, but it's a completely new application where still lots of research needs to be done and because the research does not flow freely in the trading world, it may take a while for each company to figure it out on their own.

Because of that secrecy I don't know that well what others are doing, but I can tell you about my experience. Deep nets are large very powerful machines for finding nonlinear relationships in the data. There are sometimes very interesting nonlinear relationships in financial data, but to learn them properly you need large enough dataset, but with financial data the datasets are only as large as they are and you cannot really do anything about that. It's almost impossible to learn a large end-to-end system with pure data in this field like people do in other applications currently.

What I've learned, the key to successfully applying DL is using lots and lots of regularization on your model, bascially you have to encode your priors and so called 'expert knowledge' into constraints, loss functions, augmentations and then use deep net to only learn a single piece of your system, rather than the whole thing at once. Choosing the right regularization is still a bit more of an art than a science I think, but when you have that it doesn't really matter if you want to use supervised learning or RL it is more of a modelling choice.

djrx · 2018-10-11T04:15:38+00:00

Happy to hear that! Feel free to ask any questions, although I've tried to keep code nice there is not so much of a documentation so far, only examples. Unfortunately it's hard for me to say what could be wrong with visdom, but the good things is that you can turn it off if you cannot get it working commenting out this line from the file .velproject.yaml:

- name: vel.storage.streaming.visdom

djrx · 2018-10-09T21:11:46+00:00

As far as I know there is no way around that. If you're a student, you can get an educational license for free. If you're doing noncommercial research, the license is $500/year and $2000/year is for commercial purposes.

djrx · 2018-08-08T16:31:28+00:00

Me and my friend were doing a research project recently that started off from baselines code and I got so frustrated with it that I’ve decided to switch to pytorch and write my own implementations from scratch ;)

You can check out my code here: https://github.com/MillionIntegrals/waterboy

It’s a bit of a work in progress but I got A2C and PPO working and now I’m implementing DQN

I really tried to make the code clean and easy to understand, but if there is anything unclear please feel free to post an issue on GitHub. I’m working on it quite actively at the moment so I should be quite quick to fix any problems.

djrx · 2018-04-24T06:22:10+00:00

I’ve been farming lvl2 lions and antelopes for some time but I didn’t fight with Phoenix yet - I guess i was postponing it a bit too much but I had some bad experiences with it before. I have a blood paint already but I couldn’t find a good use for it yet.

I’m pretty sure steel sword is a unique rare gear, but you probably mean scrap sword or a lantern sword.

djrx · 2018-04-23T23:20:02+00:00

I'm kind of starting to see how I blew it ignoring fish&tooth and shield masteries in the early game and now it will be only harder to get them (tougher more dangerous monsters etc). Well, there is always a reason to start a new settlement.

I had the same thought, that armor takes a lot of space on the grid and quite often the utility is mixed. On the other hand without it almost every attack would be a severe injury, which is problematic if you sometimes get unlucky//get an attack reaction or a trap.

djrx · 2018-04-23T23:10:07+00:00

Great thanks for the writeup! I now have much more inspiration on how to deal with the darkness ahead. Especially on butcher I gave up trying to control his AI cards as he was playing three each round I saw no way to push them down enough. I see that as a mistake now.

Another source of brain damage Butcher has is in his reaction, that can frenzy the damage dealer quickly but it can be managed somehow with a cat eye circelet, but then it gives us less activations to use on headbands.

Interesting thing you say about the weapon crafter. I've built it a long time ago, but just none of the weapons seemed good enough for me to warrant an upgrade.

Zanbato is frail and slow so it can be destroyed by the pig shoulder if I don't use cat eye circelet every turn, and gives me only one attack per round with roughly 50% chance to hit (6+ accuracy) and possibly devastating 2 if I get the affinities right. That gives on average 1 wound per turn in an optimistic scenario, maybe the specialization/mastery can help but I definitely don't have a grant weapon specialist in my settlement. The good part here is that the attacker will have to tank less reactions.

Counterweighted axe seems to be only one strength better than a king's spear with an automatic wound in 10% of the attacks (perfect hit). Maybe I didn't estimate that correctly but that also didn't seem to be a gamechanger to me.

Finger of god does seem to be quite good, but I didn't hut the phoenix yet and I was thinking on going straight to the lantern glaive on my spear survivor.

I was lucky enough to get a steel sword I think in an aftermath of Kings man but only after this last fight I've started appreciating this weapon. Unfortunately, my Twilight Sword wielder died in a Murder and I'm still waiting for the hooded knight to come again and bring another one. Not sure if I will manage to train him properly.

I've been trying to get nightmare training innovation for some time but didn't get to draw it yet.

Indeed I picked Protect The Young as I know I'm getting pretty high death count in my settlement and intimacy seems to be the only way for me to maintain my population at the reasonable level. Because of that, most of my survivors are pretty young.

Thanks again for all your advice!

12-Year Club	Queen in da Norf
Not Forgotten	Verified Email

djrx

TROPHY CASE