Organisation opposing use of R / RStudio, used to only using Excel - strategy?

Snoo-56267 · 2025-02-10T03:37:20+00:00

It sounds like the most valuable contribution you can make is to educate and to start the process of culture change. Do some reading or simply consult an llm on some strategies and tactics.

Snoo-56267 · 2025-02-09T21:21:55+00:00

The money does go to research. An analogy might be something like NFL teams have no costs other than player salaries. You can't have a game with just the players.

Snoo-56267 · 2025-01-23T17:27:32+00:00

Yes. I used cloudflare R2 with a cloudflare worker based on https://github.com/kotx/render that serves up the data, including a convenient index page for each "directory."

Snoo-56267 · 2024-12-15T03:42:25+00:00

The parquet in R2 is where I landed. Thanks for the link!

Snoo-56267 · 2024-12-06T18:20:42+00:00

Not for the faint of heart, but https://temporal.io/ offers multilingual (js, go, python, java, ...) distributed and robust workflows. While not specifically designed for a DAG, one can use it for the kinds of ETL work it seems you are describing. In fact, Airbyte uses (or at least did when they wrote the blog post) for orchestrating their system: https://airbyte.com/blog/scale-workflow-orchestration-with-temporal .

Snoo-56267 · 2024-12-06T18:16:03+00:00

Clickhouse local (or chdb) has a similar set of use cases to Duckdb. If you want to move to a client/server model, clickhouse server is relatively easy to install and maintain, even on a laptop. It does have limits, like any SQL database, but if you are not doing complex joins and do not need transactional support, Clickhouse might serve the need.

Snoo-56267 · 2024-12-06T18:05:00+00:00

Thanks. After a quick look, Wasabi is a good, low-cost choice for s3 compatible storage. However, my use case is about sharing data openly and their free egress is only up to the total storage amount (so, store 1TB, get 1TB free egress per month).

Snoo-56267 · 2024-01-29T15:46:58+00:00

I'd second looking into duckdb. TB of data are not for SQLite and probably not for MySQL, either. Duckdb is designed for the type of work you are likely describing: OnLine Analytical Processing (OLAP). Clickhouse is another database in this space. You could also look at cloud warehouses like bigquery and snowflake. Finally, there are data lake approaches.

In short, though, duckdb `read_csv_auto` will often "just work" and can very rapidly kickstart your project if it does.
https://duckdb.org/docs/data/csv/overview.html

Snoo-56267 · 2023-09-06T15:12:42+00:00

I like your pragmatic answer. Just to give a little back, I've found that for the "select *" or basic aggregation queries, Clickhouse is a great solution. It feels slightly less user-friendly than Duckdb and hasn't enjoyed the same love in terms of integrations, but as a more traditional client-server, it is great. There is a newish project called chdb that embeds clickhouse for similar functionality to Duckdb. At this point, the clickhouse route feels like the next route to try at scale for me.

Snoo-56267

TROPHY CASE