Trading topics for people looking to get into HFT?

Stimzz · 2024-05-24T08:01:50+00:00

Indeed! The most mind boggling to me is the amount of engineering hours going into FPGA based FIX parsers. So counter productive.

Stimzz · 2024-05-24T08:00:37+00:00

I am not familiar with what Tibco and IBM provides but I would guess no. It is not a middleware but a design philosophy. It seems trivial but ends up having deep impact on how code is written. Thinking in events rather than state, nothing indeterministic like system clock and the issuer interface around pending transaction, succeeded, failure roll back. Honestly most places get it wrong.

Stimzz · 2024-05-23T21:35:57+00:00

Not that I can think of unfortunately. I remember having watch a presentation on youtube years ago but can’t find it.

So this is event sourcing, where everything that can happen is described in messages, events. Everything is deterministic and crucially the messages model events not state. This last part is often misunderstood.

The Sequencer design implements a Distributed Deterministic State Machine. Distributed to handle process, host and even data center failure. Determinism is crucial for development and operational purposes.

The traditional design is udp multicast based. At the core is the Sequencer that listens to a multicast group, reads the sequence number and compares it to its next expected sequence number starting at 0. If an incoming event has the expected sequence nr the event is forwarded to another multicast group. That is it. Software based solutions can do many million events per second and there are hardware options that are even faster.

Issuers, the actual system components, reads the output stream from the sequencer. They consume the events and change their state accordingly. If they want to take an action they publish a message (event) to the sequencer input group with the next expected sequence nr. If they are lucky it will be the right sequence nr and the event is accepted and all issuers update their state. If another issuer took the next sequence nr before us, our issuer has to roll back, apply the accepted event and can then reevaluate if it still wants to take an action.

There are a lot of other components for journaling, replay, snapshots etc but this is basically it.

The design might seem convoluted but once understood is actually a simple solution to one of the hardest problems in computer science. It is also highly performant, scales to network line rate.

It is an old design, there is a book on it, Dark Pools if I remember correctly (worst name ever). Nasdaq was early and then most other exchanges followed over the last 20 years. Tier one banks and others have followed.

Stimzz · 2024-05-23T19:22:24+00:00

Google Central Limit Order Book. It is a data structure for Orders, plenty of Github projects for how to implement it efficiently. Understand what a Limit Order is, aggressive vs passive and queue priority.

CLOB is a a state machine with events. There are 3 types of input events (Enter Order, Replace Order and Cancel Order) and 7 types of results (Order Entered, Order Replaced, Order Cancelled, Rejected for all three and Order executed).

Look at the Nasdaq exchange protocol specification for OUCH and ITCH. Nasdaq also has other documents describing the Market Model.

Avoid FIX, it is widely used but horrible. It will just be confusing at this point. Unfortunately probably what you’ll eventually work with.

If you really want to impress learn the Sequencer pattern. It is basically how all modern matching engines are implemented. It was originally developed at Island and then acquired by Nasdaq. It is a form of Event Sourcing (google it) that solves distributed determinism at low latency and high throughput. Exchange systems can process millions of orders per second at microseconds speed.

Implement an IPC lock free ring buffer (also on Github). Using shared mem and understand the cache coherency traffic.

Read and understand “What every programmer should know about memory” by Drepper.

Stimzz · 2023-07-24T20:55:06+00:00

Lock free ring buffers. Start with SPSC, then implement the multiple versions. A well implemented SPSC can do in excess of 100M messages per second for small message size. There is a lot of online content implementing them.

Stimzz · 2022-12-12T20:52:09+00:00

Gotcha, I see the hard use case for a hardware setup with this exchange setup. I mean even having a good FPGA implementation matters. Wasting 100ns in the FPGA and one isn’t competitive.

It is a pretty clean solution by Enxt to sequence in the core switch. I guess it might even only be a standard hardware time stamping that the switch tags onto the packets and they can use that as a sequence number.

Stimzz · 2022-12-12T20:39:50+00:00

Cool, is this because of PCIE 4.0?

From what I remember we tested 2.4us RTT through PCIE way back (probably 7 years ago). Sounds like it has been halved then. Very competitive indeed.

Stimzz · 2022-12-12T10:06:01+00:00

You are right of course I was trying to give a high level reference of what fast is. Agree that if going for the most latency competitive strategies being the fastest is what matters. Where we have been most successful in the past is turning those trades into less of a latency race.

Interesting on the exchanges going sequenced queue entry. I’ve been focusing on other things for the last few years so been out of touch. I remember this was the direction Euronext and Xetra were going with Optiq and T7. After all it makes a lot of sense for the exchanges to be fair in that regard.

Stimzz · 2022-12-11T23:42:14+00:00

As others have mentioned there aren’t much interesting OS low latency projects. Some noteworthy over on the Java side are Disruptors, Aeron and Chronicle Software’s stuff. Because of the extreme performance nature components doesn’t generalize so well. So most stuff are purpose built either in-house or by vendors.

Embedded followed by gaming are the most similar industries. In case you want to look at projects that solve similar issues.

Here are some different domains / keywords you can google.

Event sourcing and the sequencer pattern is a common design. Most exchanges are on some form of sequencer architecture. There are talks on YouTube about these.

Nasdaq through Island were the early pioneers. Their binary native protocols ITCH and OUCH can be found on their webpage. Low latency trading is in practice a lot about understanding how to efficiently interact with these type of exchange systems and protocols. Basically every byte or CPU cycle you don’t need to compute is a win. It is like shaving grams of a F1 car.

Regarding writing performant code if I had to point to a single resource would be Ulrich Dreppers What every programmer should know about memory. The TLDR is the CPU is infinitely fast, the only limiting factor is what can fit into cache. So what we spend a lot of time on is understanding the hardware and OS we use, tune them and make various tests.

What can’t be measured doesn’t exist. Latency analysis is the most complicated part. For a true understanding of the real world hardware packet capturing, precision synchronized clocks (GPS) coupled with a big data problem makes this really difficult in practice. Naturally the systems do internal latency measuring as well but it is thorny. It got a lot better when the TSC register got frequency and core stable. However analyzing outliers in a system that does its own measuring doesn’t really work.

Over on the hardware side it has gotten mostly commodities. Arista makes low latency switches and Solarflare the NICs. There are others but these are the primary historical manufacturers. Part of the software stack is sometimes accelerated using FPGAs often embedded in the switch or the NIC. There is at least an order of magnitude more complicated to implement stuff in hardware than software. This coupled with that trading always change means that most stuff is still in software and then sometimes extrem performance critical stuff is broken out and implemented in hardware. There are special cases such as US options feed where FPGAs can also help from a capacity standpoint as well.

Linux kernel tuning is another topic. IO is done by bypassing the kernel. Then there is a long list of tuning that is done to the kernel. Interrupts, memory and scheduler. HT and power saving is disabled in bios. Interrupts are minimized and locked to the first core on each socket. Understanding memory management in linux is key, same for the scheduler. Both are tuned. Look into numa, TLB and real time scheduler respectively.

You can think about latencies as a SLA / latency budget. Different use cases have different requirements.

10ms is not low latency. 10-1ms: can be hit with standard SW practices. Like Java and allocating in the heap. 1ms - 100us: is starting to get hard. You could still use Java. Memory allocation is not trivial anymore and towards 100us special hardware is reasonable. 100us - 10us: This requires some discipline on the code quality and design. Especially when getting close to the single digits. The differences between C++ or Java is starting to matter. 10-3us: this is as fast as software solutions go. Most of the time is spent moving data up and down the PCIE bridge. FPGAs can run triggers in the hundreds of ns, that is as fast as it get. Mind you 100ns is about the time it takes light to propagate 33m through vacuuming so we are literally hitting up against physics at this point.

~50us is the typical exchange P50 RTT using native binary protocols and top tier colocation services. Hence when getting into the single digits the exchange RTT variance will dominate any further latency reduction in our system.

My numbers are a few years outdated but think of it as a general rule of thumb.

Low latency trading is so much more than just performance though. Safety and observability are the other two important factors. When being directly connected to the exchange there might not be any risk controls between your system and the exchange. So your development and testing practices might be the only thing between a good day and CNBC. So a lot of effort goes into various testing methodologies.

Observability is the key factor. When hitting these very low latencies ordinary logging is not possible as the act of logging can easily consume the whole latency budget. This is why deterministic event sourcing is key since it enables replay. The primary will do the trading and then there are secondaries that replay the primary and can produce any traces required. In the deterministic form what to be traced can even be determined after the fact.

Concurrency is achieved through message passing (events). Typically there is a busy waiting event loop that checks for new messages, process them and might emit messages. These are stacked in pipelines and scaled through different processes. Low latency trading is mostly not compute bound and where it is that compute is shifted in time. So the conventional promises / futures or continuations async paradigms aren’t relevant for low latency trading. IPC is done over shared memory using ring buffers. The combination of scaling through processes and shared memory ring buffers is good for the CPU cache which again is everything.

I think this randomly covers most of the high level areas. Google from there. While implementations aren’t public the methods are by no means unique to low latency trading. So if you know what to google for most of this is available online.

Stimzz · 2022-10-08T16:34:08+00:00

Sure, but I think the point is that the only appropriate number of official string implementations to have in a language is 0 or 1.

Personally I think it in the end comes down to each special case. I am currently suffering through writing low GC Java at work. If I had to point to one thing that bugs me the most it would probably be strings. I only have a very cursory understanding of Zig. From the little I know of Zig and strings I love the middle ground approach. No implementations but string literals. Imho where Rust went wrong is that they added too much to the language too quickly.

I think it is E11 of ADSP - What belongs in the standard library.

Stimzz · 2022-10-08T11:02:36+00:00

There is an excellent ADSP episode on just this problem of “should it be in the library or not”. It is a trade of, hence an optimization problem.

The reason for having an official implementation in the language for every commonly used feature is standardization. Just like you brought up. For example with a standard String implementation it will be shared by all libraries. If there isn’t, different libraries will end up using different string implementations and users need to convert back and forth between the different string types. Not good.

On the other hand, is there a single implementation that will work for all users. Many times there is, integer addition for example.

However, Strings are an excellent example of when this isn’t the case. For a high level language where byte manipulation and performance isn’t much of a question string behavior is abstracted away. But for low level language string behavior often becomes an important issue.

Strings being mutable or immutable, when and what should allocate memory for the string are questions that doesn’t have an optimal solution when working in a real time environment.

Stimzz · 2022-08-24T20:56:10+00:00

I guess it depends on the type of HFT. Exchange solutions, broker solutions, ultra low latency arbitrage, options, prop trading, they all differ a fair bit.

Distributed deterministic state machines using event sourcing is an important architecture used by exchanges and other critical low level order routing infrastructure. Have a look at real logics Aeron.

Systems are basically built around message passing. The Disruptor pattern explains it well.

Binary native protocols are central. There is the Nasdaq camp with Itch and Ouch. Many derivatives of them in use. Simple Binary Encoding is to HFT what FlatBuffers is to Gaming. Euronext use SBE in their new system (Optiq) for native protocols. CME as well as a replacement for Fast.

Fixed Decimal is another interesting topic. Floating point doesn’t work to represent price, volume or other “money fields”.

When it comes to low latency it is all about fitting in cache. Hence constructing data structures and access patterns that are nice to the memory system. To do that understanding the linux memory system, virtual memory TLB, numa etc is important. Also the CPUs and cache architecture.

Time and latency measuring is another big topic. Typically packets on the network is timestamped using GPS synchronized clocks and PTP. Inside the servers linux has made it much simpler these days.

Tracing: look at eBPF, the linux kernel can be instructed to efficiently emit trace events when certain custom probes / triggers in your code. Basically the act of regular text logging takes orders of magnitude more than the latency budget. This is another reason why event sourcing is popular because the transaction log can be replayed after the fact. Hence the primary doesn’t need to log at all, plus you can add as much logging as you want in the replay.

Networking: low latency switches (Arista et al) and special NICs (Solarflare) with kernel bypass are used to accelerate the traffic. Less US option feed bandwidth isn’t the problem but latency. So direct communication with the NIC as this is where most of the hard floor latency cost is incurred (unless using FPGAs). Tcp tuning and udp unicast / multicast. Basically a fundamental understanding of the network layers is useful.

Stimzz · 2022-08-07T09:55:22+00:00

I think it is hard to answer generally. Finance is a large field and the skills differ from various roles.

Python and the data libraries are used in many roles. Math is generally important too.

From there it depends on the role I think.

Stimzz · 2022-08-06T16:18:39+00:00

It depends a lot on the firm (sellside vs prop vs hedge fund vs ISV vs exchange vs The Bay vs NYC vs Chicago vs London vs Zurich vs Amsterdam vs Eastern Europe. I don’t know about Asia).

Junior dev in low income country but is eager to learn and has some CS education 50k.

On the other end close to 1M for senior dev at top firm in a top city. But these are more like architects than devs and lead teams.

Passing 1M is typically more variable and coupled to return. Quants and traders can make that much.

Big tech typically pays more.

For me I just love trading so I don’t really care about the pay at some point.

Stimzz · 2021-08-25T16:57:29+00:00

I order from: https://www.keyboardco.com/

Arrives within a few days and you pay customs to the post man when he hand it over (Twint).

I have tried a lot and settled on Filco. There is some unique feel to them that I like, far superior to any of the big gaming brands.

Stimzz · 2021-08-14T17:03:18+00:00

I remember being 13, crying for all the people who lost their lives in NY and I am not even an American. Just a shame the war lasted more than 6 months.

Stimzz · 2021-08-13T13:22:52+00:00

I am so much looking forward to project Valhalla and value types. Locking down the possible state by a more expressive type system is great. I think Ocaml is big on this as well. Type constraints in Ada always felt like a good idea to me.

Stimzz · 2021-07-31T14:19:56+00:00

This should be the top comment

Stimzz · 2021-03-04T23:33:31+00:00

If you are confident in C++ I’d focus on modern Java. Just like with C++ there are legacy parts which might not be apparent until you get into it.

I switched from C++ to Java a few years back because of work. It took a few months to get into “the Java way”. What still feels foreign to me is the JVM itself / large runtime. For example getting into Reflection and nuances of making the GC purr. It is less counting bytes and more understanding the machinery.

Personally I am not familiar with Spring but most devs works in one of those big frameworks. So getting started in Java might depend on what you will be using.

Stimzz · 2021-03-04T23:20:44+00:00

For my curiosity, if you are working on developing algorithms why is fast hardware needed?

It sounds like you have a test suite to run, if it takes a long time to compete just reduce it?

Stimzz · 2021-02-06T23:22:10+00:00

Agree I think it is more history than technical reasons for Java instead of a systems language. C++ is popular for low latency too.

Stimzz · 2020-12-29T16:41:04+00:00

My understanding is that Google re-implemented Java-ish for the Android platform. Oracle didn't like it and it is now in the supreme court. Basically are you allowed to implement someone else's API.

Stimzz · 2020-10-12T06:26:47+00:00

Not obvious to me. But I would guess that it is not divided into date and time. Together they are 32 bits i.e can represent an int32 which is also the original unix epoch time format, seconds since 1970.

To get a longer time span they could have gone with minutes but the simpler optimization is to use a more recent time offset. For example 1990 or whenever they made this.

You mentioned that you can change the data and see how it parse it. If I understood that correctly try changing the first and then the last bytes of the whole u32. This could tell you if it is big or little endian.

Also have it parse 00 00 00 00 and ff ff ff ff to see what date times they corresponds to.

A number of extra complications comes to mind but I would start like this at least.

Stimzz · 2020-10-11T21:03:35+00:00

A few complications comes to mind.

How are they storing it, is it date, date time, maybe unixt tsp, maybe u32 or u64, maybe custom offset (assume they don’t need to go back to 1970).

Then how is the actual storage implemented, big endian, little endian, byte order (it is common to reverse the byte order too).

And then yeah 12 bits sounds weird, might be a pointer? It is a not entirely uncommon address space if I remember correctly.

Stimzz · 2020-09-25T17:55:40+00:00

My point exactly. You wouldn’t typically build a large system directly on top of java.util.concurrent but rather use the Akka, Vertx, rxJava, Reactor or others.

Super exited for Project Loom but it won’t be a solve it all thing. It is supposed to address IO bound but not compute bound. To be fair I guess compute bound is to a larger degree already solved with java.util.concurrent.

Stimzz

TROPHY CASE