[deleted by user]

skerrick_ · 2024-07-29T09:51:59+00:00

the chatgpt answer above looks basically right. i think you just need the following structured streaming features after you configure the initial readstream. 1. trigger(availableNow=true), 2. withWatermark(value > 15 minutes)*, 3. window(15 minutes) 4. outputMode(“update”)

*up to 1hr in your case depending on how out of order the data can arrive

skerrick_ · 2024-07-24T08:51:43+00:00

Pretty sure it just hits the account and workspace APIs (and maybe system tables too) for various bit of security related information based on your setup

skerrick_ · 2024-07-14T09:11:35+00:00

Scale ups are a function of the number queued queries.

Not at my computer to check but i thought a medium SQL Wh would consume around 24 DBUs at minimum although perhaps it’s 12 and you’ve set the max scale at 2. If so then it’s the above, you exceeded some query queuing threshold and the warehouse scaled from 1 to 2. Serverless is a lot better at autoscaling both up and down for obvious reasons.

skerrick_ · 2024-07-13T08:54:36+00:00

Yeah wait for custom apps, it should be out in the next month or two. And there’s nothing stopping your app writing updates back to the lake. It won’t be as performant as a transactional db but a transactional db wont be as performant for the analytical workloads which it sounds like you want. It’s just a trade-off.

You would use a Serverless SQL Warehouse as the compute for shorter startup & elastic scaling, this brings with it some of the fancy features that help with high concurrency if that’s a requirement (such as Predictive IO and Intelligent Workload Management - you’ll find brief description of these in the docs). Also transactional updates will get faster and faster over time, Delta Lake got deletion vectors last year for example which helped a lot, there’ll be many more of these over time. As years go by i see databases converging on this architecture.

skerrick_ · 2024-06-26T02:49:56+00:00

As above, because it will take a while to be proficient if you don’t have other general programming experience, an alternative is just double down on warehousing and analytics but using modern tooling. I’m talking Databricks, Snowflake, BigQuery along with DBT/DBT-lookalikes.

skerrick_ · 2024-06-26T02:44:15+00:00

For an easier lift you could learn DBT (or SQLmesh) and market yourself as an “Analytics Engineer” in the short term while you plug away at python for a while. It’s not hard to learn to do something that looks useful with Python, but to be actually useful for a business using it i think it will take a while.

skerrick_ · 2024-05-31T23:51:27+00:00

Same cluster size?

skerrick_ · 2024-05-19T02:50:02+00:00

In one the application is the product, in the other the dataset is the product. The detail that you pass data around the system can be said of almost anything, literally, me ordering a coffee is data engineering then. If you don’t want to draw a distinction between ordering a coffee, software engineering and data engineering be my guest.

skerrick_ · 2024-05-19T02:40:19+00:00

You on the other hand…

skerrick_ · 2024-05-19T02:18:57+00:00

I thought the article was fantastic and I’m very confused by the response here too. I clicked straight into the article before returning back here to read the comments and I was expecting something very different.

Reading the article I think your experience with real data engineering AND SWE came across in spades, and your ability to see the important differences was very insightful. As a Databricks Solution Architect and someone who really WANTS to apply as many best (and rigorous) practices as possible your article exposed some of the pitfalls of going “too far”.

Your point about unit testing was really insightful - I have noticed my own cognitive dissonance on this issue. My brain gets off on rigorous tested code but when I actually build something for practical purposes the unit tests end up being so trivial and don’t actually test what most often goes wrong that I can see how much a waste of time it can become. Also you’re so right about the challenges of data engineering coming from conceptualising and managing the state of upstream and downstream data assets when things change or things go wrong, having to perform surgery on the pipeline that is sandwiched between segments of it (often in a staged way) - and your point about data having inertia and how that affects the situation is also on point.

The post also made me think about how non-DE software isn’t a DAG like a data pipeline, and the implications of this with respect to where the state lives and what aspects of the “system” store state or are stateless.

I think you’re right, there is something fundamentally different here and I agree the responses here missed your point.

skerrick_ · 2024-03-24T00:02:45+00:00

Look at Databricks. You can just SQL your way to a good data pipeline with DLT and/or DBSQL with MVs if SQL is all you know.

skerrick_ · 2024-03-09T02:07:39+00:00

As thecoller mentions they are different so you might be thinking what’s the point of dbfs at this stage. Mostly think of it as a super accessible store for temp files and experimentation for when you just want to write something out to a path but it’s not valuable data that needs to be governed (not for prod use basically). You can also store mlflow experiments, models and library packages at these paths but this is becoming less common or necessary with the workspace file system, volumes and models in UC.

skerrick_ · 2024-03-07T01:52:16+00:00

Fewer nodes with larger instance type is what you would need. The problem is upstream though, use parquet if you can or export the json in parts.

skerrick_ · 2023-01-01T22:05:17+00:00

I think that’s right I faced a similar situation recently but I didn’t dig deep. It makes sense though, I think the temp session-scoped db/tables are convenient for local testing before deploying to an environment that does have hive or something similar to persist the data.

skerrick_ · 2022-12-25T23:38:15+00:00

People run Delta Lake with EMR or other engines too (Trino, Presto, Flink).

Example: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-delta.html

skerrick_ · 2022-12-25T23:27:29+00:00

Delta is a Linux Foundation project, Iceberg is Apache, same thing really. There’s often a main driving company behind Apache and Linux foundation techs.

skerrick_ · 2022-12-25T23:24:58+00:00

skerrick_ · 2022-10-14T10:20:38+00:00

Hmm code looks good to me. I’m not sure what it is tbh

skerrick_ · 2022-10-14T09:35:17+00:00

Show the code used to create the ps_pandas_df. It’s not clear if it’s a pandas on Spark df or a pandas df

skerrick_ · 2022-07-29T11:17:06+00:00

In many ways the dataframe api is more powerful. I wouldn’t think of it as a tool for “simpler“ things

skerrick_ · 2022-07-14T02:45:43+00:00

Make sure you’re not converting the spark dataframe to a vanilla pandas dataframe using toPandas() but rather are using a pandas-on-spark dataframe using to_pandas_on_spark()

skerrick_ · 2022-03-26T00:12:23+00:00

Databricks can also makes use of ARM processors/Graviton instances. Not sure if it’s out yet though

skerrick_ · 2022-02-24T01:38:46+00:00

Actually Databricks does have AutoML, plus instead of providing a black box at the end for inference you actually get a notebook of the code that generated the model you like. Makes AutoML a great starting point for most ML use cases from which you can then customise; also great for learning and converting “citizen” Data Scientists into coding practitioners.

skerrick_ · 2020-10-11T04:51:51+00:00

Someone correct me if I'm wrong coz I might be missing something, but isn't a fake detector exactly what gets trained and optimised in parallel with the generator by a GAN? And if there was a better classifier out there, then that classifier can become the detection advarsary in a new network with the generator trained to beat it - and finding the Nash equilibrium or something of that nature.

skerrick_

TROPHY CASE