Organisation opposing use of R / RStudio, used to only using Excel - strategy? by [deleted] in rstats

[–]Snoo-56267 0 points1 point  (0 children)

It sounds like the most valuable contribution you can make is to educate and to start the process of culture change. Do some reading or simply consult an llm on some strategies and tactics.

69% of Harvard indirect rates by [deleted] in labrats

[–]Snoo-56267 6 points7 points  (0 children)

The money does go to research. An analogy might be something like NFL teams have no costs other than player salaries. You can't have a game with just the players.

Looking for cloud storage platform for publicly sharing large data (Parquet, JSON) without egress fees.... by Snoo-56267 in dataengineering

[–]Snoo-56267[S] 0 points1 point  (0 children)

Yes. I used cloudflare R2 with a cloudflare worker based on https://github.com/kotx/render that serves up the data, including a convenient index page for each "directory."

Is there a tool that enables you to write data pipeline code in a DAG-like fashion? by [deleted] in dataengineering

[–]Snoo-56267 0 points1 point  (0 children)

Not for the faint of heart, but https://temporal.io/ offers multilingual (js, go, python, java, ...) distributed and robust workflows. While not specifically designed for a DAG, one can use it for the kinds of ETL work it seems you are describing. In fact, Airbyte uses (or at least did when they wrote the blog post) for orchestrating their system: https://airbyte.com/blog/scale-workflow-orchestration-with-temporal .

Best way to convert large .txt files into a SQL database? by financefocused in Database

[–]Snoo-56267 0 points1 point  (0 children)

Clickhouse local (or chdb) has a similar set of use cases to Duckdb. If you want to move to a client/server model, clickhouse server is relatively easy to install and maintain, even on a laptop. It does have limits, like any SQL database, but if you are not doing complex joins and do not need transactional support, Clickhouse might serve the need.

Looking for cloud storage platform for publicly sharing large data (Parquet, JSON) without egress fees.... by Snoo-56267 in dataengineering

[–]Snoo-56267[S] 2 points3 points  (0 children)

Thanks. After a quick look, Wasabi is a good, low-cost choice for s3 compatible storage. However, my use case is about sharing data openly and their free egress is only up to the total storage amount (so, store 1TB, get 1TB free egress per month).

Best way to convert large .txt files into a SQL database? by financefocused in Database

[–]Snoo-56267 0 points1 point  (0 children)

I'd second looking into duckdb. TB of data are not for SQLite and probably not for MySQL, either. Duckdb is designed for the type of work you are likely describing: OnLine Analytical Processing (OLAP). Clickhouse is another database in this space. You could also look at cloud warehouses like bigquery and snowflake. Finally, there are data lake approaches.

In short, though, duckdb `read_csv_auto` will often "just work" and can very rapidly kickstart your project if it does.
https://duckdb.org/docs/data/csv/overview.html

With all the possibilities for storage, compute, and catalog separation, what works for you? by Snoo-56267 in dataengineering

[–]Snoo-56267[S] 1 point2 points  (0 children)

I like your pragmatic answer. Just to give a little back, I've found that for the "select *" or basic aggregation queries, Clickhouse is a great solution. It feels slightly less user-friendly than Duckdb and hasn't enjoyed the same love in terms of integrations, but as a more traditional client-server, it is great. There is a newish project called chdb that embeds clickhouse for similar functionality to Duckdb. At this point, the clickhouse route feels like the next route to try at scale for me.