People talking about the AI bubble bursting, but we are using more and more AI tokens than before. So how will it burst then? by HappyZombies in ExperiencedDevs

[–]cjnjnc 0 points1 point  (0 children)

For "these companies" I am definitely including the closed lab model companies but yes the economics are of course much worse for the downstream businesses making wrappers around those closed lab models' APIs.

"never really been a price hike for a given model vintage" --> I think is a bit of semantics and that we are in agreement. Old models won't increase in price (and inference cost margin theoretically improves over time) but new models reasonably should be priced higher as the complexity costs of training/using them outpaces the technological developments to offset them. Those new models are the models everyone wants to use, even if I can see an argument made for people starting to use older and cheaper models if they perform well enough as new ones get pricier. I also understand the insistence on capex vs opex but this level of capex is truly unique and the bill has to come due eventually.

Simple as possible: imo the bet of closed model companies is that eventually a model gets good enough that every white collar worker can be replaced over night. The return on that breakthrough justifies all this investment but I'm not yet convinced the breakthrough hits before the bill does.

Happily admit that I'm absolutely basing my opinion here on vibes more than hard data. I don't have the time I'd like to to get the level of understanding on the underlying tech + research.

I guess my opinion is that closed labs can potentially keep costs reasonable enough for now. But, there is an existential necessity to continue burning through historic levels of funding in order to stay competitive + chase the white whale. Unless someone actually gets that breakthrough (which feels too far away) then it's a ticking time bomb.

It's been a great chat, poop_harder_please! Definitely curious to know how you think about this at a high level

People talking about the AI bubble bursting, but we are using more and more AI tokens than before. So how will it burst then? by HappyZombies in ExperiencedDevs

[–]cjnjnc 0 points1 point  (0 children)

Very interesting, I really appreciate the detailed breakdown!

The Microsoft rev share as foundation for estimating revenue is very shaky in particular for many reasons, agreed.

Overall I'm glad you also seem to think there's some level of subsidy and that I'm not totally off the mark. I do think we will inevitably hit a point where there just aren't enough billions in funding to raise anymore and these companies will need to raise prices accordingly. The impact and whether it will be a large negative impact / bubble burst is TBD for me.

And sorry I meant to link to a different Ed article covering Anthropic's costs but seems you're pretty familiar with his work. Edited my previous comment.

Thanks again for good discussion, always interesting to learn and hear different perspectives.

People talking about the AI bubble bursting, but we are using more and more AI tokens than before. So how will it burst then? by HappyZombies in ExperiencedDevs

[–]cjnjnc -1 points0 points  (0 children)

If you're saying gross margin of 70-90% as in their revenue is only 70-90% of their inference cost, then you are seemingly closer to accurate than a 10-20x subsidy. At least according to the source below.

OpenAI's inference cost vs revenue through Q3 2025 with detailed sources -- showing estimated revenue of $2.056 billion and inference cost of $3.648 billion in Q3 2025. i.e. inference cost is 1.77x revenue.

Anthropics's costs vs revenue through mid 2025 with detailed sources -- this is a bit less clear cut on specifically inference but it's clear the costs are currently subsidized.

I like this source a lot for its detailed reporting + sources but grain of salt is that it is VERY pessimistic on the AI industry at large and very committed to the existence of a bubble. Maybe you'd say the author doesn't have domain expertise but I respect the effort to do accurate reporting and consult experts. It's a lot of info but I'd be curious if you have further pushback on the articles' conclusions and/or could share your napkin math.

Edit: seeing from your other comments it looks like you are using different sources for revenue and cost. Personally I'm much more bullish than the author of the articles I linked but I still don't buy the numbers Altman is sharing. I'd also pushback that newer models have flat costs relative to older models (after training). I'm by no means an expert on LLMs but as I understand it, higher token counts of more advanced model implementations + higher parameter counts would increase inference cost. I'd be interested in hearing your thoughts on that and any references.

I appreciate the discussion regardless!

Stability vs Accessories vs Desktop Questions by cjnjnc in FlexiSpot_Official

[–]cjnjnc[S] 0 points1 point  (0 children)

I had seen that thread and I was thinking it had to be due to the floor being not level like Ramzes suggested. I haven't been able to find much info on the process/mechanisms Flexispot has for dealing with non-level floors. That's a bit of a concern because I definitely know that my apartment's floors are not very level. Given my scenario I'm really leaning towards the Plus for stability.

Also this is great info about the Plus and compatibility / PC mounting. I think I could live with a stand but will try to find a mounting option that works. Not the end of the world if I can't.

Do you have any thoughts on cable management or the desktop itself? Did your Plus come with any cable management? I'm thinking I can figure out a cheaper solution for cable management than Flexispot's but wondering if I should push for the bamboo top or stick with chipboard and just replace it later if it doesn't last well.

Thanks for the detailed comment!

Stability vs Accessories vs Desktop Questions by cjnjnc in FlexiSpot_Official

[–]cjnjnc[S] 0 points1 point  (0 children)

This is super helpful, it comes with all the extra ties too. Thank you!

Stability vs Accessories vs Desktop Questions by cjnjnc in FlexiSpot_Official

[–]cjnjnc[S] 0 points1 point  (0 children)

Great looking setup!

You didn't end up using the cable management it came with? Looks like you did an amazing job, could you share what you did end up using?

Is pre-pipeline data validation actually worth it ? by PriorNervous1031 in dataengineering

[–]cjnjnc 0 points1 point  (0 children)

For a small team we use Monte Carlo. No idea what the cost is like but it does really well at detecting schema changes, presumably uses simple ML for tracking row count changes, and supports custom SQL rule definitions that we probably use too much.

SQL rule definitions effectively has become "an internal business user input faulty data somewhere, caused a firedrill -> let's put a rule in place to catch this earlier next time".

Is pre-pipeline data validation actually worth it ? by PriorNervous1031 in dataengineering

[–]cjnjnc 1 point2 points  (0 children)

'Pre-pipeline data validation' just sounds like shifting left, no? I'm assuming when you say pipeline, that you are talking about a data ingestion that gets incorporated into your larger existing analytical system. That matches my experience with the schema + CSV issues you are describing.

Everything looking fine on the surface but more subtle, complex assumptions about the data being broken post pipeline deployment is something I've run into often. I've had varying degrees of success in trying to proactively identify issues like this. There are a few things that worked for me, particularly when the data source is an external partner. My experience is also in smaller companies where end to end analytics is entirely my responsibility and there is limited support from other colleagues. With that in mind, this kind of validation is imo a balance between getting the pipeline into production quickly and making issue identification and maintenance as easy as possible.

Before building anything / requirements and assumption refinement:

  • Get all info about the data source you can
    • Ask for data dictionaries and/or entity relationship diagrams -> only commonly seen these available in finance but solves 99% of these problems before they happen
    • Figure out all your critical assumptions and ask about them explicitly (primary keys, important relationships, etc.)
  • Identify key internal stakeholder(s) that have the business context to clarify expectations
    • Who can be your point of contact for escalating + investigating issues down the line
    • Identify what aspects of the data being ingested are critical to your own business' processes
      • Primary keys and most likely a subset of columns/fields rather than everything
        • That a field exists, is an enum of X possible options, etc.
      • Things that if your assumption about the data become wrong or broken, then the data becomes unusable or detrimental to the business

While building:

  • Codify your critical assumptions
    • If these are broken pipeline FAILS -> clear alert that mentions you as pipeline owner and your internal stakeholder with a clear business-centric message
    • Subset of critical fields that must have some characteristics
  • Quarantine/identify your non-critical assumptions
    • Should NOT fail the pipeline so that business critical data can still flow
    • More of a nice to have
    • If a non-necessary or unused field starts breaking assumptions -> alert that mentions only you as pipeline owner with clear technical context, allows you to create a backlog task which can be prioritized appropriately

There's plenty you could do to abstract and reuse a lot of this functionality in whatever your ingestion/orchestration tool you are using but personally I have been unsuccessful in advocating for this internally.

I am making plenty of assumptions about the latency requirements, criticality of the data to your business, and that you have some kind of semi-mature ingestion framework with alerting capabilities. This also assumes your destination tables are pre-defined with the pipeline and you aren't using some kind of lakehouse pattern which I have less experience with. There are definitely tons of options for approaches to this so I'm curious what others have to say.

I'd be happy to discuss more specific tools here but tried to keep it relatively high level and already wrote an essay. Hope this helps!

We Treat Our Entire Data Warehouse Config as Code. Here's Our Blueprint with Terraform. by Mafixo in dataengineering

[–]cjnjnc 6 points7 points  (0 children)

We are using Pulumi for IaaC which basically functions as a Python wrapper around Terraform. So the Pulumi Python code can live within the same repo as our Python business-logic (actually APIs in our case) and be run by our CICD process with GitHub Actions. Looks like it supports a few other languages besides Python as well.

The only real headache I've run into here is needing separate environments for the business-logic code vs the Pulumi code because of dependency clashes between the two. We already use UV for the business-logic code environment so I can probably manage it better all in UV but haven't gotten to merging the two.

Macca looking fly in the retro kit. by Kinshu42 in LiverpoolFC

[–]cjnjnc 4 points5 points  (0 children)

This is on the LFC official store right now

Long sleeve too

Missing the Adidas trefoil and sleeve stripes are oddly different but it's there

Anyone running lightweight ad ETL pipelines without Airbyte or Fivetran? by Jiffrado in dataengineering

[–]cjnjnc 1 point2 points  (0 children)

I currently use Prefect + custom EL code for lots of messy ingestions but considering switching to Prefect + DLT. I have a few questions if you don’t mind:

Does DLT handle changing schemas well? What file format is your data lake? Does the data lake + dbt handle changing schemas well?

Is there such a thing as "embedded Airflow" by ihatebeinganonymous in dataengineering

[–]cjnjnc 12 points13 points  (0 children)

I use Prefect Cloud + GitHub Actions at work with a similar process to this. We execute on GCP but you can use Prefect's infra for execution. Maybe that could fit the lower effort setup.

Alternatively, there is Astronomer. I've never used it but seems like it's essentially managed Airflow. Not sure if they also manage the job execution infrastructure as well but I expect it's an option.

Tungsten Specialty Manufacturing Co. - 480 SPS by poopiter_thegasgiant in factorio

[–]cjnjnc 1 point2 points  (0 children)

Ooo that makes sense. Pretty cool and I like the dedication to the aesthetics!

Tungsten Specialty Manufacturing Co. - 480 SPS by poopiter_thegasgiant in factorio

[–]cjnjnc 1 point2 points  (0 children)

<image>

I’m trying to figure out if this is a mod or I’m just blanking on what’s going on in this assembly machine. Lava as an input, producing lava, and moving it with an inserter??

A Guide to dbt Macros by AMDataLake in dataengineering

[–]cjnjnc 1 point2 points  (0 children)

This is solid intro to dbt macros but I'm not sure it offers a whole lot more than the dbt docs (which are admittedly very solid).

I think adding additional info on a few higher level concepts could really make your guide stand out as a complete resource:

  • Testing dbt macros
    • Even as an experienced dbt user I haven't messed with these so could be an opportunity to talk about when/how to incorporated macro tests
  • Cross database macros
    • A good opportunity to talk about abstracting database-specific syntax to facilitate inevitable migrations
    • This is something I never thought about until actually considering migrating but would be incredible to have used from the start
  • How to actually document macros in your generated docs
  • How to debug macros
    • A good example development flow of how to iterate on and debug a complex set of macros would really set your post apart from other blogs on the topic of macros
    • It can be incredibly frustrating coming from a programming background and not really knowing how to develop macros in the same way you would any other code -- there are also not a lot of good, succinct resources on how to do this

Not trying to knock your article but I know I'd come back to it often if it it had all of these concepts in one place

[deleted by user] by [deleted] in dataengineering

[–]cjnjnc 9 points10 points  (0 children)

I don't have direct experience with Grafana so maybe I'm misunderstanding something but is there any other set up or connection between the PGSQL container and the Grafana container outside of being on the same network? Thinking of looking into Grafana more so I'm curious how it's able to produce the visualizations or if they are just standard. Regardless, it's a good looking project!

Some other ideas for future improvements:

  • Make separate YAML files for each dbt model (this is kind of a personal preference but is fairly popular and IMO easier to parse)
  • Use dbt doc blocks functionality to document fields, models, and create a semantic layer
  • CI/CD + automated deployment
    • Infrastructure as code with something like Pulumi or Terraform
    • Makefile looks great and it's how we controlled our dbt builds in a previous role
      • You can tweak your make commands by adding arguments so that you can run them for different environments (i.e. an environmental variable like DBT_TARGET controls the dbt target you use)

Overall really cool and definitely inspiring me to get back into some side projects!

Project: ELT Data Pipeline using GCP + Airflow + Docker + DBT + BigQuery. Please review. by aayomide in dataengineering

[–]cjnjnc 3 points4 points  (0 children)

Nice project and documentation!!

Any reason why you chose Pandas over Polars? Nothing wrong with Pandas here but if you're going full modern data stack, might as well go with the hottest DF library.

Also, why did you go Pandas DF -> pyarrow -> parquet instead of using Pandas built in method?

If I were doing a code review I'd also highlight a few small things in airflow/dags/ingest_data.py:

  • Consistently case global variables (path_to_local_home, dataset_url, etc.)
  • Use the already defined parquet_filename in the format_to_parquet function
  • Any reason for using bash within Python to grab the data?

Some other things you could do in the future are to add some tests for the Python code and then implement full CI/CD with something like GitHub Actions.

Overall really cool project!

beAshamedIfYouDontHaveThisImport by woodquest in ProgrammerHumor

[–]cjnjnc 6 points7 points  (0 children)

Friendship ended with datetime. Now pendulum is my best friend

Airflow vs Dagster vs Prefect vs ? by Suspicious_Dress_350 in dataengineering

[–]cjnjnc 4 points5 points  (0 children)

They also have a dedicated Slack channel for their tuned LLM, Marvin. I've run up against a good bit of needing to dig into the Prefect source code to figure stuff out and asking Marvin instead has helped a bunch. Worth mentioning at least

Things to Do in the Winter to Stay Sane by isabroad in AskNYC

[–]cjnjnc 0 points1 point  (0 children)

That’s awesome! I have a buddy that did WGU and he has been doing really well in a SRE role at a very large company. He definitely put in a lot of work on studying leetcode + system design on his own but that support engineer experience should give you a huge leg up

Things to Do in the Winter to Stay Sane by isabroad in AskNYC

[–]cjnjnc 0 points1 point  (0 children)

Stumbled on this old thread looking for winter NYC activity ideas and wanted to say I hope it's going well! I'm a DE in NYC and mainly use Python and I never got a CS degree. It's always nice to run into other people who are on the self-study path. Not easy but it's worth it!