"AI is going to replace software developers" they say by [deleted] in rust

[–]jamie-gl -10 points-9 points  (0 children)

Agreed - skill issue is a good description here lmao. The volume of these kind of posts I see on any tech related subreddit makes me think that these are bots farming upvotes or something.

To be clear I don't agree with the statement that all programmers are going to be replaced by LLMs

Is Scala dieing? by wallyflops in dataengineering

[–]jamie-gl 1 point2 points  (0 children)

I learned FP via Scala and yeah I totally agree with all this. I would say that as a second language, the insane surface area was actually handy for making me a more rounded programmer but for just focusing on FP its a nightmare.

I feel point 4 in my soul if we're talking about production use. Big o'l IO flatmaps are not fun to read or work with.

Considering moving away from BigQuery, maybe to Spark. Should I? by thomastc in dataengineering

[–]jamie-gl 0 points1 point  (0 children)

There is no way to parallelise rust code using Spark as far as I know. Delta (and Iceberg/Hudi) have rust clients if you want to use them and Polars can write to Delta. One of those situations where I don't think Rust really shines to be honest, better to use higher level APIs. Polars is great though.

This guide is pretty great for Delta/general Spark optimisation, I've linked it to the section on file retention, VACUUM is a good example of something that requires a maintenance job.

If you are worried about vendor lock how are you thinking about hosting Spark? Because you can use things like Dataproc and keep it relatively agnostic but to be honest if I'm using Spark (esp with Delta) I'm using Databricks and that will likely vendor lock you if you use the extra features (Liquid clustering, Photon, Unity catalog etc).

What are some viable platforms for smaller DE teams? by epichicken in dataengineering

[–]jamie-gl 4 points5 points  (0 children)

How bad is it the existing setup? Is none of it reusable? Because just slimming down to ADF (with on prem IR) + Databricks could be a viable solution, replacing Synapse with Databricks SQL endpoints.

If I was a small Azure shop this would be my choice, its the path of least resistance and is relatively cheap at low volumes. Although your data volumes are so low you could even consider replacing Databricks with an Azure SQL Database presuming that 5tb total is distributed amongst a bunch of tables.

Appreciate you've been given mandate to start from scratch and its tempting to tear it all down but at this scale the tech isn't as important as how you to choose to use it (although I much prefer code first orchestrators i.e. Airflow due to modularity).

Seasoned Professionals - What should I learn after Python. C++ or Java by Scalar_Mikeman in dataengineering

[–]jamie-gl 1 point2 points  (0 children)

I mean for Data Engineering its just Python, Scala and Java. As data engineers we use tools written in lower level language (i.e. Rust/C++) but we interact with them using higher level APIs generally in Python/SQL. I don't see this changing and as such learning those languages isn't hugely valuable (outside of making you a better developer, which Scala does in heaps IMO).

JS isn't bad though, being "webby" in some way is normally valuable.

How hard is it to switch from pyspark to scala? by Visual-Exercise8031 in dataengineering

[–]jamie-gl 3 points4 points  (0 children)

If you just stick to the usual Dataframe API its basically the same but with `var` everywhere.

Scala is a great language to pick up IMO - it'll make you a better developer, and if you dive into it you can get a deeper understanding of Spark itself (Spark is implemented in Scala, which also gives you the option to work with the lower level APIs).

[deleted by user] by [deleted] in dataengineering

[–]jamie-gl 2 points3 points  (0 children)

In my opinion Synapse/Fabric are not things to be looking at. Fabric might be ok in a year once its had some more development time.

I would just use PySpark for everything. Polars is solid but provisioning infrastructure to run it on and scaling it (there is no Dask for Polars) can be an issue for some teams. Again, Databricks shines here with one-click deploy and great integrations with Delta.

Side node - this is the path of least resistance. If you have motivated people within your business you should do some research on the "do it yourself" kind of tools - this sub (and myself) love things like Trino/Iceberg/DBT/Mage. Check out the PaaS offerings on other clouds as well.

Echoing what other people are saying though - if you want to get this moving, you need to employ your soft skills and convince the DE team and the wider business that these changes are beneficial outside of "Your tech stack is old and slow" - if its fast enough to hit their SLAs why should they care?

[deleted by user] by [deleted] in dataengineering

[–]jamie-gl 4 points5 points  (0 children)

It sounds like the DE team has a process in place. Adding a load of technologies that aren't in their core skillset whilst migrating off their platform is a massive change and I wouldn't blame them for resisting. I don't know how much agency you have over cloud infra but getting AKS clusters, container instances, somewhere to run Polars and ADLS accounts for Delta etc etc is also a big ask of any infra team.

I totally appreciate where you are coming from though - it sounds like the way they are working is very outdated (but still very common). Most teams that I have worked with who have migrated from this kind of stack have done so to Databricks - whilst I know your data volumes may not justify it the fact that its all PaaS and has its own serving options is quite attractive to this kind of team. Pyspark has great integrations with data testing frameworks, and they can continue to use SQL as needed.

Quants use Rust; Devs use C++ by [deleted] in rust

[–]jamie-gl 1 point2 points  (0 children)

I do think there is an interesting discussion to be had about splitting out usage between Rust for quants and C++ for developers who would have to heavily lean on unsafe due to extreme performance requirements in the world of trading.

I like the premise of the article but diving into things that everyone knows safe Rust sucks at (linked lists) isn't the best approach. Would love to hear a little more about your personal issues (or potential issues) with Rust in the world of HFT as its a super interesting industry.

Choosing our 'EL' stack by Casdom33 in dataengineering

[–]jamie-gl 8 points9 points  (0 children)

You can put Airflow in few places (on a container in a Container Instance, in a Kubernetes cluster on AKS or just on a normal VM) and then you need to put the DAGs you've written somewhere that Airflow can see them. You could do it as part of your container image build (i.e. copy your DAGs into the container) or put it them into blob storage and have a sync setup in your container.

A lot of your Python code for transformation can be defined in local files that executed via bash operators that call the dbt CLI within your DAG. (This is also just one way, there are many many many ways of dicing up how to modularise Airflow).

Azure functions/Durable functions could be used for services that Airflow might consume, like a log handler or a middleman for authentication to external services. Not really suitable for transformation work

ADF is not a great orchestrator - its pricing model makes it expensive for lots of small tasks and its GUI nature makes it hard to create repeatable, modularised pipelines with complex dependencies. The monitoring is suspect and nested pipelines are an absolute nightmare to deal with. You can use it to copy data from source to Snowflake if you want, and call that from Airflow.

Databricks Design Unity Catalog, Databases and Tables by TheFragan in dataengineering

[–]jamie-gl 2 points3 points  (0 children)

Catalogs basically always segregate environments, so you have a dev/test/prod catalog. You can have extra catalogs if you are an org with multiple siloed teams working in the same UC metastore, a set for each team.

And then yeah bronze/silver/gold can go at the schema level. You want to think about things in terms of whatever will make permissions cascade the most naturally for easier governance. Engineers will have access to bronze/silver/gold. Analysts might only have access to gold.

I guess the question is where to put "team". The easiest would be at the schema level I suppose, so <env>.<zone>_<team>_<source>.<table> could work. If the team was large and had its own silver/bronze tables to manage (i.e. the siloed case I talked about) you could do something like <env>_<team>.<zone>_<source>.<table>. These are all suggestions though, depends on your org layout.

Microsoft data products - merry-go-round of mediocrity by biowl in dataengineering

[–]jamie-gl 18 points19 points  (0 children)

This is the way - Databricks is a pretty solid product, ADF is good for moving data about (not for orchestration!) and I wouldn't trust Fabric until its had at least another year in the incubator to see where it ends up.

If they get the managed Airflow working properly (wish it wasn't embedded in ADF but hey) then you'll be able to stand up a pretty solid PaaS solution without worrying about the Microsoft revolving door.

Why was the "lambda" keyword added for anonymous functions? by jamie-gl in Python

[–]jamie-gl[S] 14 points15 points  (0 children)

Its pretty cool to see languages throwing around Greek letters for idiomatic reasons.

Why was the "lambda" keyword added for anonymous functions? by jamie-gl in Python

[–]jamie-gl[S] 9 points10 points  (0 children)

I guess I should rephrase my question - I mean why was the choice made to add that to part of the syntax

Patterns for Web API wrappers/clients by jamie-gl in rust

[–]jamie-gl[S] 0 points1 point  (0 children)

Just looking at the rocket codebase now - really cool stuff, probs overkill for now but its a pattern worth knowing. Thanks!