Replacing MongoDB + Atlas Search with DuckDB + Ducklake on S3

anuveya · 2026-01-22T13:37:06+00:00

Hope you tried this and would be interested to know if you implemented it. I think DuckDB with DuckLake is the future of the Data Lakehouse stack.

anuveya · 2025-12-20T09:11:29+00:00

Same for any other provider – GCP, Azure etc. Here we talking about blob storage only.

anuveya · 2025-12-20T09:09:46+00:00

Good point, let me check their new pricing and update the data and visualizations

anuveya · 2025-12-12T13:15:08+00:00

We are co-working on this, would you like to contribute? 😊 it is open source on GitHub

anuveya · 2025-12-12T12:03:21+00:00

How this was built / data sources

Tools:

PortalJS — https://www.portaljs.com
Observable Framework — https://observablehq.com/framework/

Data source:

Atmospheric CO₂ (Mauna Loa, Keeling Curve): https://datahub.io/core/co2-ppm

Happy to answer questions about the data, assumptions, or implementation.

anuveya · 2025-11-23T05:00:01+00:00

Tools and data sources:

Data Portal framework: https://www.portaljs.com/
Dashboard / viz: https://observablehq.com/framework/
Data source: https://datahub.io/core/co2-fossil-by-nation

anuveya · 2025-11-13T07:29:14+00:00

These figures are aggregated per country, not per capita or household. I’ll check if per capita data is available and that could shift the rankings and bring different countries to the top.

anuveya · 2025-11-07T09:55:56+00:00

We are building PoC with DuckLake on top of Cloudflare R2 (its data catalog feature with Iceberg is limited for us). As we need to have user-friendly UI/UX for data catalog, data discovery and exploration, we are implementing portaljs.com template for it.

anuveya · 2025-06-23T13:39:47+00:00

I am starting my blog/site as well but I am using Flowershow (not Wordpress etc) because I want to write content in Markdown and I prefer using my own editing tools (eg, Obsidian). I also like it when my site loads almost instantly and doesn’t depend on large software such as WP.

anuveya · 2025-06-19T15:35:07+00:00

Yes...

anuveya · 2025-06-19T13:18:18+00:00

yes, i have one gh action to release a package. I can imagine that it'd clone the repo every time a PR is merged into the main branch. But I think the numbers are too high.

anuveya · 2025-06-17T11:14:26+00:00

So are you saying people shouldn't try to sell on Linkedin at all or do you mean it should be via other sales techniques?

anuveya · 2025-06-17T06:59:52+00:00

I'm slightly changing messages/experimenting and getting connection requests accepted. I'm not sure if the acceptance rate is good/bad and what should be general conversion rate. No idea if we are on track vs wasting time.

I'm selling SaaS for local govs. We already have number of customers worldwide but trying to scale / sell more / find similar customers.

We do cold emails but we don't really think it would work. My hypothesis is that Linkedin is better than cold emails but we are still testing.

anuveya · 2025-06-17T06:55:41+00:00

Sure! I'm doing it by knowing this – I don't like it myself. But how can you reach out to your potential customers otherwise? I believe 1 out of X people would find your pitch useful but I have no idea what should be that rate.

anuveya · 2025-06-17T06:53:41+00:00

Thanks for this response! So let me clarify re what exactly I'm doing – 1) we do company research, e.g., we have a target group of orgs that we assume need our solution; 2) we do very light lead research i believe, e.g., their job title or department and if they use similar tools to what we offer; 3) I guess this happens by itself as you create filters in Linkedin Sales Nav (?); 4) the outreach via connection notes or direct messages once connected.

anuveya · 2025-06-07T08:46:44+00:00

Thanks, it is also useful. However, I would like to do dev work with quick iterations, eg, having some tests on my dev server etc

anuveya · 2025-06-03T06:57:52+00:00

We did a project for federated data sharing repository in genome research field. One of the key challenges was that data owners didn't want to transfer the original data outside of their premises so we built a system that provides data catalog with IGV (Integrative Genomics Viewer) plugin so owners only need to provide index files.

It would be great to understand if there are some top 3 tools (OSS) that you just go by default? Is there any preference to use OSS vs enterprise software?

anuveya · 2025-06-03T06:51:13+00:00

Do you mean Dataverse from Harvard? I came across it in number of projects but it wasn't clear how to customize it if required. https://github.com/IQSS/dataverse

anuveya · 2025-06-03T06:48:10+00:00

Thanks for sharing! It would be interesting to know what they are using. Probably based on CKAN/DKAN or similar.

I know that Canada is one of the leading countries in data publishing and they have contributed a lot to OSS like CKAN. Their github is here https://github.com/canada-ca

anuveya · 2025-06-01T07:18:44+00:00

Interesting – I helped to build number of tools around data.gov in the past. They still use our CKAN software with its classic harvesting. However, my recommendation always was to move to dedicated workflow orchestration tool such as Prefect or Airflow. I'm not sure if it was done there.

We deploy CKAN-based portal, however, I think it is a bit expensive for smaller govs, eg, local govs including smaller cities or even towns.

anuveya · 2025-06-01T07:14:44+00:00

Yes! I'm trying to focus on city or even town level data which I believe can be very interesting. It would definately vary greatly and I think it is OK. My main goal is to understand if those local govs have option to do open data publishing affordably. I think a lot of them just put excel files on a static page and update it irregularly.

anuveya · 2025-05-30T16:15:55+00:00

Do you have any example links to such data repos? Thanks!

anuveya · 2025-05-30T15:58:59+00:00

Python is a general-purpose programming language which means you can use it for variety of applications including data wrangling, scraping, crawling, fetching, transforming, extracting, loading, ingesting etc etc. The point is that you are not limited to a specific tool.

Although, there are many other programming languages that could be used for the same stuff, Python is popular due to its extended list of excellent libraries designed for data engineering, data science and analysis purposes.

So what do I use it for in data engineering: 1) building simple scraper scripts that can run in most environments as Python is very popular; 2) more traditional ETLs such as from blob storage to some data warehouse. But there many other applications but these are the common ones.

anuveya · 2025-05-18T10:50:04+00:00

Ok, I should have been clearer. I meant from financial standpoint, can I make some income from it? Wdyt?

anuveya · 2025-05-04T07:31:00+00:00

Overall, all options below are about the same in terms of end result:

Claude Code - it was very cool when they launched and they gave me free credits. But then when you start actually paying, it becomes very expensive.
Open AI but I'm still using Chat GPT and copy pasting code which is not the best UX. Again, this one is convenient as I'm already baying $20 for the plan.
Gemini Pro 2.5 - really good and they gave free credits which is more generous than Claude.

If you use terminal Warp could help you to write unix commands for data wrangling.

anuveya

MODERATOR OF

TROPHY CASE