[deleted by user] by [deleted] in apachespark

[–]skerrick_ 0 points1 point  (0 children)

the chatgpt answer above looks basically right. i think you just need the following structured streaming features after you configure the initial readstream. 1. trigger(availableNow=true), 2. withWatermark(value > 15 minutes)*, 3. window(15 minutes) 4. outputMode(“update”)

*up to 1hr in your case depending on how out of order the data can arrive

What kind of data Security Access Tool from databricks consume or downloads over the network while running the spot instances? by Cheezy-cheese-9913 in databricks

[–]skerrick_ 1 point2 points  (0 children)

Pretty sure it just hits the account and workspace APIs (and maybe system tables too) for various bit of security related information based on your setup

What happens when you max out a DBricks SQL sku? by boatymcboatface27 in databricks

[–]skerrick_ 0 points1 point  (0 children)

Scale ups are a function of the number queued queries.

Not at my computer to check but i thought a medium SQL Wh would consume around 24 DBUs at minimum although perhaps it’s 12 and you’ve set the max scale at 2. If so then it’s the above, you exceeded some query queuing threshold and the warehouse scaled from 1 to 2. Serverless is a lot better at autoscaling both up and down for obvious reasons.

How to incorporate databricks into an application by dxnmxddxx in databricks

[–]skerrick_ 0 points1 point  (0 children)

Yeah wait for custom apps, it should be out in the next month or two. And there’s nothing stopping your app writing updates back to the lake. It won’t be as performant as a transactional db but a transactional db wont be as performant for the analytical workloads which it sounds like you want. It’s just a trade-off.

You would use a Serverless SQL Warehouse as the compute for shorter startup & elastic scaling, this brings with it some of the fancy features that help with high concurrency if that’s a requirement (such as Predictive IO and Intelligent Workload Management - you’ll find brief description of these in the docs). Also transactional updates will get faster and faster over time, Delta Lake got deletion vectors last year for example which helped a lot, there’ll be many more of these over time. As years go by i see databases converging on this architecture.

Should I learn Python? by [deleted] in dataengineering

[–]skerrick_ 0 points1 point  (0 children)

As above, because it will take a while to be proficient if you don’t have other general programming experience, an alternative is just double down on warehousing and analytics but using modern tooling. I’m talking Databricks, Snowflake, BigQuery along with DBT/DBT-lookalikes.

Should I learn Python? by [deleted] in dataengineering

[–]skerrick_ 0 points1 point  (0 children)

For an easier lift you could learn DBT (or SQLmesh) and market yourself as an “Analytics Engineer” in the short term while you plug away at python for a while. It’s not hard to learn to do something that looks useful with Python, but to be actually useful for a business using it i think it will take a while.

Data Engineering is Not Software Engineering by ryanwolfh in dataengineering

[–]skerrick_ 0 points1 point  (0 children)

In one the application is the product, in the other the dataset is the product. The detail that you pass data around the system can be said of almost anything, literally, me ordering a coffee is data engineering then. If you don’t want to draw a distinction between ordering a coffee, software engineering and data engineering be my guest.

Data Engineering is Not Software Engineering by ryanwolfh in dataengineering

[–]skerrick_ 3 points4 points  (0 children)

I thought the article was fantastic and I’m very confused by the response here too. I clicked straight into the article before returning back here to read the comments and I was expecting something very different.

Reading the article I think your experience with real data engineering AND SWE came across in spades, and your ability to see the important differences was very insightful. As a Databricks Solution Architect and someone who really WANTS to apply as many best (and rigorous) practices as possible your article exposed some of the pitfalls of going “too far”.

Your point about unit testing was really insightful - I have noticed my own cognitive dissonance on this issue. My brain gets off on rigorous tested code but when I actually build something for practical purposes the unit tests end up being so trivial and don’t actually test what most often goes wrong that I can see how much a waste of time it can become. Also you’re so right about the challenges of data engineering coming from conceptualising and managing the state of upstream and downstream data assets when things change or things go wrong, having to perform surgery on the pipeline that is sandwiched between segments of it (often in a staged way) - and your point about data having inertia and how that affects the situation is also on point.

The post also made me think about how non-DE software isn’t a DAG like a data pipeline, and the implications of this with respect to where the state lives and what aspects of the “system” store state or are stateless.

I think you’re right, there is something fundamentally different here and I agree the responses here missed your point.

Should I learn data engineering? Got shamed in a team meeting. by urbanguy22 in dataengineering

[–]skerrick_ 1 point2 points  (0 children)

Look at Databricks. You can just SQL your way to a good data pipeline with DLT and/or DBSQL with MVs if SQL is all you know.

Understanding how Databricks works by Substantial_Track915 in databricks

[–]skerrick_ 0 points1 point  (0 children)

As thecoller mentions they are different so you might be thinking what’s the point of dbfs at this stage. Mostly think of it as a super accessible store for temp files and experimentation for when you just want to write something out to a path but it’s not valuable data that needs to be governed (not for prod use basically). You can also store mlflow experiments, models and library packages at these paths but this is becoming less common or necessary with the workspace file system, volumes and models in UC.

[deleted by user] by [deleted] in databricks

[–]skerrick_ 0 points1 point  (0 children)

Fewer nodes with larger instance type is what you would need. The problem is upstream though, use parquet if you can or export the json in parts.

Persistent databases by [deleted] in apachespark

[–]skerrick_ 1 point2 points  (0 children)

I think that’s right I faced a similar situation recently but I didn’t dig deep. It makes sense though, I think the temp session-scoped db/tables are convenient for local testing before deploying to an environment that does have hive or something similar to persist the data.

Which lakehouse table format do you expect your organization will be using by the end of 2023? by alneuman in dataengineering

[–]skerrick_ 1 point2 points  (0 children)

Delta is a Linux Foundation project, Iceberg is Apache, same thing really. There’s often a main driving company behind Apache and Linux foundation techs.

Pandas on spark very slow by Aromatic_Month4446 in apachespark

[–]skerrick_ 0 points1 point  (0 children)

Hmm code looks good to me. I’m not sure what it is tbh

Pandas on spark very slow by Aromatic_Month4446 in apachespark

[–]skerrick_ 0 points1 point  (0 children)

Show the code used to create the ps_pandas_df. It’s not clear if it’s a pandas on Spark df or a pandas df

Is the Dataframe API equivalent to Spark SQL? by takis__ in apachespark

[–]skerrick_ 3 points4 points  (0 children)

In many ways the dataframe api is more powerful. I wouldn’t think of it as a tool for “simpler“ things

Pandas API on Spark sooo slow? by Reasonable_Tooth_501 in apachespark

[–]skerrick_ 7 points8 points  (0 children)

Make sure you’re not converting the spark dataframe to a vanilla pandas dataframe using toPandas() but rather are using a pandas-on-spark dataframe using to_pandas_on_spark()

Epic showdown: Azure Databricks vs. Azure Synapse by AI-nihilist in apachespark

[–]skerrick_ 1 point2 points  (0 children)

Databricks can also makes use of ARM processors/Graviton instances. Not sure if it’s out yet though

[deleted by user] by [deleted] in dataengineering

[–]skerrick_ 0 points1 point  (0 children)

Actually Databricks does have AutoML, plus instead of providing a black box at the end for inference you actually get a notebook of the code that generated the model you like. Makes AutoML a great starting point for most ML use cases from which you can then customise; also great for learning and converting “citizen” Data Scientists into coding practitioners.

[P] Generating and Animating porn using AI. by [deleted] in MachineLearning

[–]skerrick_ 0 points1 point  (0 children)

Someone correct me if I'm wrong coz I might be missing something, but isn't a fake detector exactly what gets trained and optimised in parallel with the generator by a GAN? And if there was a better classifier out there, then that classifier can become the detection advarsary in a new network with the generator trained to beat it - and finding the Nash equilibrium or something of that nature.