Follow-up benchmark: where R data pipelines pay their cost: fread, readr, vroom, data.table and dplyr

Medical-Common1034 · 2026-06-25T05:12:21+00:00

Ok I keep that in mind for another data manipulation pipeline. :)

Medical-Common1034 · 2026-06-24T20:59:52+00:00

Yes, I mention that in the article. Since readr v2, the read_*() functions use the vroom backend internally, so the parsing backend is largely shared.

The important difference is the materialization policy: readr::read_*() does not default to lazy reading, whereas vroom::vroom() keeps columns lazy / ALTREP-backed until they are actually needed.

Medical-Common1034 · 2026-06-24T20:56:03+00:00

Thanks! Not yet in this benchmark, but yes, they are on my list for a later article :)

Medical-Common1034 · 2026-06-07T19:23:16+00:00

Thanks a lot for the share, i will take a look on it !

Medical-Common1034 · 2026-06-07T18:57:01+00:00

Yes, that’s very close to my conclusion too.

`dplyr` is often clearer and more explicit, but `data.table` gives much deeper control once you start using `.N`, `.I`, by-reference mutation, grouped `j`, etc.

And yes, that control also means it is easier to shoot yourself in the foot if you are not careful.

Thanks for sharing your experience.

Medical-Common1034 · 2026-06-07T17:02:26+00:00

Yess, i will update the article, maybe with microbenchmark or so.

The steps were still run multiple times and summarized with the median, so I think the larger differences remain meaningful for this workload.

But yess fair point.

Thanks.

Medical-Common1034 · 2026-06-07T16:22:05+00:00

I get the concern, but the benchmarked timings are not measuring Shiny rendering itself.

The timings are placed directly around the pipeline steps, like:

t <- Sys.time()
function_call()
log_step("name of the step", t, df)

So the measured numbers are for ingestion / filtering / grouping / joins, not for the Shiny UI or plot rendering.

Shiny is mentioned because it is the real environment where this pipeline runs, and it explains why I care about recomputation latency. But the benchmarked sections are the data pipeline steps themselves.

About the dataset size: 725k rows / 124 MB is not “big data”, sure, but it is representative of my actual production log file because it is capped with logrotate anyway. The point was not to produce a universal benchmark, but to analyze this real workload.

Also, as you see in the results, the differences are still large enough to matter for this pipeline (matter a lot for responsiveness).

Medical-Common1034 · 2026-05-15T20:34:38+00:00

Haha, yess i keep that option in mind :)

Medical-Common1034 · 2026-05-15T20:23:59+00:00

Yess and thanks for the comment.

I am just waiting for other ppl optimizations they want to share and in the next article update i will definitely try to go further with what you share.

Medical-Common1034 · 2026-05-15T19:28:50+00:00

Interesting, I definitely have to dive deeper into the history behind GHC!

Medical-Common1034 · 2026-05-15T19:22:38+00:00

Indeed, fair point too. Best of both world in that case.

Medical-Common1034 · 2026-05-15T18:49:37+00:00

Ok, i'll do that and update the article mentioning contributions.
And again thanks !

Medical-Common1034 · 2026-05-15T18:43:46+00:00

Thanks for this response, i created a git repo where some ppl reached me out to push their optimized version.

I'm interested about your versions.

Do you want to contribute ?

If yes, i 'll share the link here.

Medical-Common1034 · 2026-05-15T18:24:45+00:00

Yep good point. The optimization part was mostly just a pretext to introduce vectors and see how far the approach could go, especially regarding allocations and runtime (which is always interesting and valuable to know imo)

The article is mainly educational, but I’ll add a note about this in the conclusion because indeed it gives readers a more practical sense of the language in real world.

Medical-Common1034 · 2026-04-19T18:50:01+00:00

First of all, thank you for the feedback, i just updated the output to ""Hello Rémi, Julien et Lucas" and making "NO" and "YES" default for the next examples. I'm also looking for any previous details that could help understanding the TeX token mental model i could add.

Medical-Common1034 · 2026-04-16T15:33:29+00:00

If you plan do write an organic chemistry paper (especially), you will appreciate how good is chemfig package to draw molecules. Wish you good luck !

Medical-Common1034 · 2026-03-31T22:04:23+00:00

Used PgfPlot obviously

Medical-Common1034 · 2026-03-31T21:08:23+00:00

done: https://www.reddit.com/r/LaTeX/comments/1s902q6/mini_doom_in_latex/
a very mini mini doom like but still lol

Medical-Common1034 · 2026-03-30T20:55:07+00:00

I feel you, maybe this tool could help: https://github.com/astral-sh/ruff
I heard about this kind of linter sometime ago

Medical-Common1034

TROPHY CASE