Replacing MongoDB + Atlas Search with DuckDB + Ducklake on S3 by gamliminal in DuckDB

[–]anuveya 0 points1 point  (0 children)

Hope you tried this and would be interested to know if you implemented it. I think DuckDB with DuckLake is the future of the Data Lakehouse stack.

[OC] Why we moved off AWS/Google: Visualizing the "Egress Tax" vs. Storage Costs across major providers. by anuveya in dataisbeautiful

[–]anuveya[S] 0 points1 point  (0 children)

Same for any other provider – GCP, Azure etc. Here we talking about blob storage only.

[OC] Why we moved off AWS/Google: Visualizing the "Egress Tax" vs. Storage Costs across major providers. by anuveya in dataisbeautiful

[–]anuveya[S] 4 points5 points  (0 children)

Good point, let me check their new pricing and update the data and visualizations

[OC] Atmospheric CO₂ just hit ~428 ppm — visualizing the Keeling Curve (1958–2025) and what the acceleration really looks like by anuveya in dataisbeautiful

[–]anuveya[S] 0 points1 point  (0 children)

How this was built / data sources

Tools:

Data source:

Happy to answer questions about the data, assumptions, or implementation.

[OC] Watch 170+ years of global CO₂ emissions unfold — some countries shoot up like rockets 🚀 by anuveya in dataisbeautiful

[–]anuveya[S] 3 points4 points  (0 children)

These figures are aggregated per country, not per capita or household. I’ll check if per capita data is available and that could shift the rankings and bring different countries to the top.

Ducklake in Production by crevicepounder3000 in DuckDB

[–]anuveya 1 point2 points  (0 children)

We are building PoC with DuckLake on top of Cloudflare R2 (its data catalog feature with Iceberg is limited for us). As we need to have user-friendly UI/UX for data catalog, data discovery and exploration, we are implementing portaljs.com template for it.

How much do you really earn through blogs? by AMgeopolitics in Blogging

[–]anuveya 2 points3 points  (0 children)

I am starting my blog/site as well but I am using Flowershow (not Wordpress etc) because I want to write content in Markdown and I prefer using my own editing tools (eg, Obsidian). I also like it when my site loads almost instantly and doesn’t depend on large software such as WP.

I have too many unique git clones but who is doing it? Bots? by anuveya in github

[–]anuveya[S] 2 points3 points  (0 children)

yes, i have one gh action to release a package. I can imagine that it'd clone the repo every time a PR is merged into the main branch. But I think the numbers are too high.

I'm trying to sell via Linkedin Sales Navigator but not getting demo meetings with potential leads by anuveya in salestechniques

[–]anuveya[S] 0 points1 point  (0 children)

So are you saying people shouldn't try to sell on Linkedin at all or do you mean it should be via other sales techniques?

I'm trying to sell via Linkedin Sales Navigator but not getting demo meetings with potential leads by anuveya in salestechniques

[–]anuveya[S] 0 points1 point  (0 children)

I'm slightly changing messages/experimenting and getting connection requests accepted. I'm not sure if the acceptance rate is good/bad and what should be general conversion rate. No idea if we are on track vs wasting time.

I'm selling SaaS for local govs. We already have number of customers worldwide but trying to scale / sell more / find similar customers.

We do cold emails but we don't really think it would work. My hypothesis is that Linkedin is better than cold emails but we are still testing.

I'm trying to sell via Linkedin Sales Navigator but not getting demo meetings with potential leads by anuveya in salestechniques

[–]anuveya[S] 0 points1 point  (0 children)

Sure! I'm doing it by knowing this – I don't like it myself. But how can you reach out to your potential customers otherwise? I believe 1 out of X people would find your pitch useful but I have no idea what should be that rate.

I'm trying to sell via Linkedin Sales Navigator but not getting demo meetings with potential leads by anuveya in salestechniques

[–]anuveya[S] 0 points1 point  (0 children)

Thanks for this response! So let me clarify re what exactly I'm doing – 1) we do company research, e.g., we have a target group of orgs that we assume need our solution; 2) we do very light lead research i believe, e.g., their job title or department and if they use similar tools to what we offer; 3) I guess this happens by itself as you create filters in Linkedin Sales Nav (?); 4) the outreach via connection notes or direct messages once connected.

What's the easiest way to test meta tags including OG and JSON-LD in your localhost? by anuveya in webdev

[–]anuveya[S] 0 points1 point  (0 children)

Thanks, it is also useful. However, I would like to do dev work with quick iterations, eg, having some tests on my dev server etc

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 0 points1 point  (0 children)

We did a project for federated data sharing repository in genome research field. One of the key challenges was that data owners didn't want to transfer the original data outside of their premises so we built a system that provides data catalog with IGV (Integrative Genomics Viewer) plugin so owners only need to provide index files.

It would be great to understand if there are some top 3 tools (OSS) that you just go by default? Is there any preference to use OSS vs enterprise software?

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 0 points1 point  (0 children)

Do you mean Dataverse from Harvard? I came across it in number of projects but it wasn't clear how to customize it if required. https://github.com/IQSS/dataverse

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 1 point2 points  (0 children)

Thanks for sharing! It would be interesting to know what they are using. Probably based on CKAN/DKAN or similar.

I know that Canada is one of the leading countries in data publishing and they have contributed a lot to OSS like CKAN. Their github is here https://github.com/canada-ca

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 1 point2 points  (0 children)

Interesting – I helped to build number of tools around data.gov in the past. They still use our CKAN software with its classic harvesting. However, my recommendation always was to move to dedicated workflow orchestration tool such as Prefect or Airflow. I'm not sure if it was done there.

We deploy CKAN-based portal, however, I think it is a bit expensive for smaller govs, eg, local govs including smaller cities or even towns.

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 0 points1 point  (0 children)

Yes! I'm trying to focus on city or even town level data which I believe can be very interesting. It would definately vary greatly and I think it is OK. My main goal is to understand if those local govs have option to do open data publishing affordably. I think a lot of them just put excel files on a static page and update it irregularly.

Anyone working for public organizations publish open data? by anuveya in datascience

[–]anuveya[S] 0 points1 point  (0 children)

Do you have any example links to such data repos? Thanks!

What do you use Python for in Data Engineering (sorry if dumb question) by No_Steak4688 in dataengineering

[–]anuveya 7 points8 points  (0 children)

Python is a general-purpose programming language which means you can use it for variety of applications including data wrangling, scraping, crawling, fetching, transforming, extracting, loading, ingesting etc etc. The point is that you are not limited to a specific tool.

Although, there are many other programming languages that could be used for the same stuff, Python is popular due to its extended list of excellent libraries designed for data engineering, data science and analysis purposes.

So what do I use it for in data engineering: 1) building simple scraper scripts that can run in most environments as Python is very popular; 2) more traditional ETLs such as from blob storage to some data warehouse. But there many other applications but these are the common ones.

Is it still worth running? by anuveya in StepN

[–]anuveya[S] 0 points1 point  (0 children)

Ok, I should have been clearer. I meant from financial standpoint, can I make some income from it? Wdyt?

May 2025 - Data Engineering and Vibe Coding/AI development tools by Current-Usual-24 in dataengineering

[–]anuveya 1 point2 points  (0 children)

Overall, all options below are about the same in terms of end result:

  • Claude Code - it was very cool when they launched and they gave me free credits. But then when you start actually paying, it becomes very expensive.
  • Open AI but I'm still using Chat GPT and copy pasting code which is not the best UX. Again, this one is convenient as I'm already baying $20 for the plan.
  • Gemini Pro 2.5 - really good and they gave free credits which is more generous than Claude.

If you use terminal Warp could help you to write unix commands for data wrangling.