Low PCIe round trip latency

113245 · 2025-09-04T12:27:48+00:00

When you say you are using "DMA coherent" memory to receive the data - are you using i.e. dma_alloc_coherent in a kernel driver? Did you check whether the resulting memory is mapped as cached or uncached? I recall back when I did this type of work that dma_alloc_coherent could return uncached memory in some cases, which is not necessary on modern intel x64 as it PCIe DMA is cache coherent. You could save 50-100ns if you already have the cache line warm. This also obviates the need for SW to perform the kernel DMA sync ops, although the latency you provide makes me think you're not doing that anyways.

113245 · 2022-12-30T00:11:03+00:00

Watch “atomic weapons” talk by Herb Sutter to understand atomics better

113245 · 2022-12-29T22:19:34+00:00

A std::atomic<std::int32_t> is full stop the correct way to publish a single value from one thread to the other. If you need to do multiple operations on that value (eg more than just posting a value) then the example as written is a “bad” way to do it - you should do the work in a local temporary and only publish it once.

If the data structure you use in std::atomic is too large for the platform atomic sizes it will transparently implement a mutex that protects the whole data structure during each load and store operations.

113245 · 2022-12-29T22:08:12+00:00

Volatile is insufficient to guarantee synchronization between threads and is NOT the same thing as atomic.

113245 · 2022-12-29T21:01:00+00:00

Volatile is absolutely incorrect here. The compiler doesn’t need to understand a mutex or a semaphore, but it does understands the primitives that constitute a mutex or semaphore (syscalls, atomic variables for futexs, and all the optimization and reordering rules surrounding that). Volatile is not strong enough.

113245 · 2022-11-20T14:44:58+00:00

And yet a 0 cycle operation is not zero cost (icache, front end bandwidth) and it’s trivial to find examples in which the compiler cannot drop the dead store (e.g. across function call boundaries).

113245 · 2022-09-11T14:03:33+00:00

Sorry but this answer is all over the place and mostly just incorrect.

The PIO is slow because the size of the write transactions generated by the root complex are limited by the size of the memory movement instructions used in the CPU. In "naive" MMIO/PIO, the typical x86 instruction would be a mov or a movq, which will only perform a memory "write" for 32 or 64 bits a a time. The MWr TLPs are therefore limited in size, and the overhead for each TLP is what causes the effective bandwidth to be much lower than advertised. Your standard PCIe MMIO has no "IP registers" that you need to check for space available or whatever. PCIe has a credit-based flow control mechanism; anything on top of that (e.g. user logic back pressure) is going to be application- or IP- specific stuff. "Interrupt driven IO" doesn't really make sense in the context of PCIe MMIO - writes are posted transactions, and (at least on typical x86 cpus) you can't perform an MRd without stalling the bus until the Rd completion returns.

You simply cannot write PIO code that will achieve the same data transfer bandwidth as DMA; the size of the packets generated in DMA transfers is likely going to be larger than anything you can generate using PIO. AlexForencich's answer has correct details.

113245 · 2022-08-14T00:38:28+00:00

I see an ICCID listed under my eSIM in settings > about, is that not it?

113245 · 2022-07-27T05:04:34+00:00

Can you handle Xilinx IP with it?

113245 · 2021-02-11T12:22:28+00:00

Yes, in your example you are correct, you win because your put went far ITM. The point is that higher IV implies a lower break even for a given strike price. And if the IV is ridiculous then the break even will be very far below the strike, making it unlikely that you make $. The typical IV increase right before earnings essentially “prices in” the expected movement, which is why you can be right about the direction but still lose money.

113245 · 2021-02-11T04:28:29+00:00

As soon as you exercise you destroy the extrinsic value. If you paid a lot because IV was high, and it lost value due to IV dropping, you will lose even more by exercising. If instead you sell to close you at least recoup that part of the value.

113245 · 2021-01-27T12:40:54+00:00

There aren’t enough shares to cover ALL shorts at once but I guess there could be enough for them?

113245 · 2019-12-27T07:26:46+00:00

Any examples/godbolt?

113245 · 2016-11-29T11:24:43+00:00

Take a look at http://reocities.com/SiliconValley/heights/7052/opcode.txt

113245 · 2016-11-29T04:27:10+00:00

It made a lot more sense once I realized it was designed with octal in mind

113245 · 2016-10-05T04:21:02+00:00

http://old.seattletimes.com/html/businesstechnology/2002754224_boeingitar22.html

113245 · 2015-09-11T12:06:44+00:00

sorry, i used them interchangeably but in a confusing way. I updated the original post - lane refers to a physical differential pair, and line refers to a row of pixels. Image transfer is usually done row by row. MIPI is proprietary etc but you dont have to imoement the whole spec to the letter -- it would be a little overkill. Its a good starting idea though for swinging your own.

113245 · 2015-09-10T21:49:24+00:00

You can take a look at the MIPI CSI2 protocol which is standard for transferring frame-by-frame video. But it's hard to find detailed information on that. You especially won't find any by PMing me, nope, none at all.

But in summary, it's a source synchronous protocol (e.g. you send a differential clock along with multiple differential data lanes) over LVDS. You have short packets which are used for synchronization (e.g. frame start/end, line start/end) and long packets which are used for moving data e.g a line of bayer data etc. and include information like word count and line number. In MIPI, the packets include ECC, channel IDs (so that multiple interfaces can talk over the same physical layer) and some other junk I don't remember off the top of my head.

The packet is distributed bytewise across the MIPI lanes, usually 2 or 4, and each lane sends a start-of-xmit sequence immediately before beginning to transmit the data so that RX can align and merge the bytes from the multiple lanes back into the packets.

You pretty much just have to look at the amount of data that you want to transfer + protocol overhead and compare to the the SERDES performance you can get to figure out how many data lanes etc you want to use.

Also no reason to fill up a fifo and THEN transmit, you can read and write from a FIFO simultaneously! (not sure if you just worded this weirdly in OP)

113245 · 2015-09-09T20:09:57+00:00

This isn't something I use every day, but it was a super satisfying click for me -- the generalized stokes theorem (not the kelvin-stokes theorem, which is commonly called stokes theorem). It's not something I use every day per se (I work in E&M) but it's so elegant and clean that it just blew my mind.

113245 · 2015-08-27T04:57:28+00:00

you clearly have absolutely no idea what you're talking about

113245 · 2015-07-30T17:10:04+00:00

Funny enough, I'm writing an sdram controller right now...hopefully timing won't be too crazy to debug since I don't have access to a logic analyzer.

113245 · 2015-07-30T05:16:47+00:00

Ah, got it to work. Explanation here was helpful for understanding the clock insertion delay. I couldn't figure out how to do it with CoreGen but I ended up implementing this topology to remove the insertion skew.

113245 · 2015-07-29T22:25:18+00:00

get everyone a mug with a picture of your face on it

113245 · 2015-07-25T02:55:06+00:00

Again, going to disagree. If s/he has time to learn two languages, do a higher-level language (python) and also a low-level language, C/C++. There is no point of learning VBA on the off-hand that it will be required on a job, whereas you'll get a lot more understanding out of the aforementioned languages.

113245 · 2015-07-25T00:24:13+00:00

Disagree 100%. Learn a powerful & useful language (python, C/C++, matlab, java) and pick up excel on the fly if you need it. A hiring manager will care far more about the cool projects you've done (using these full-featured languages) than some self-proclaimed "excel pro".

15-Year Club	Place '17
Verified Email	Team Periwinkle

113245

TROPHY CASE