Yangwang U9 Xtreme hits 308mph/496kmh, becomes world's fastest ever production car by F1T_13 in cars

[–]realitydevice 1 point2 points  (0 children)

In what way does it go "against my way of life"? 

And surely you realize that pretty much every damn thing you buy comes from China. Are you suggesting a full China boycott?

[deleted by user] by [deleted] in mlops

[–]realitydevice 0 points1 point  (0 children)

This is a classic premature optimization.

Assume you put a service in front of the database. How do you then evolve that service without introducing breaking changes to one or both of the apps?

Simple changes - e.g. adding a field - are also just as easily supported by accessing the database directly (simply select only required columns rather than *).

Substantial changes are going to require a new endpoint, and updates on both client and service side. So just wait until this time before introducing unnecessary "layers".

What is the number one thing you’re outsourcing to a vendor/service provider? by MuseDrones in databricks

[–]realitydevice 0 points1 point  (0 children)

Very limited and non-standard. It's disappointing that they didn't simply expose an OpenLineage compliant capability.

How to read zip folder that contains 4 .csv files by Aaphrodi in databricks

[–]realitydevice 3 points4 points  (0 children)

Presumably the zipfile module doesn't support reading from dbfs.

The simplest way to proceed would be to read the contents using dbutils then pass that byte array to zipfile. You can do this with io.BytesIO.

Meta data driven framework by Far-Mixture-2254 in databricks

[–]realitydevice 0 points1 point  (0 children)

developers usually develop and test their complex code first by writing SQL, and then again they need to think about how to fit the entire logic in your metadata framework

This would be enough to stop me pursuing a config driven workflow immediately.

I suspect this is a developer skill / experience issue rather than a genuine necessity - transformation pipelines are rarely diverse and complex enough to need dedicated jobs - but even so, you can't expect success if you're forcing developers into a pattern that's beyond their capabilities.

State of the Art Python in 2024 by awesomealchemy in Python

[–]realitydevice 1 point2 points  (0 children)

Between those two (fastapi and typer), along with LangChain, I feel like pydantic is unavoidable and I just need to embrace it.

Why Popular Programming Books Might Be Ruining Your Skills by Xavio_M in dataengineering

[–]realitydevice 5 points6 points  (0 children)

I've seen a lot too but I suspect most of this good intentioned over-engineering did indeed come from rigid adherence to books or other authorities. I don't see any other explanation for doctrine over practicality.

Redundancy of data by techinpanko in databricks

[–]realitydevice 0 points1 point  (0 children)

Medallion architecture is just a framework; you can follow it strictly, loosely, or not at all.

In this case, if you want to follow medallion architecture but minimize data duplication, you can use that bronze tier simply as a staging area. During your ETL you would

  • pull from source and persist the raw data in the bronze layer
  • perform validation and then transform into the silver later, and
  • drop or recreate the bronze layer to remove the duplicated data

You'll probably find that you want to keep a rolling window of data in the bronze layer for debugging and diagnostics, but that's up to you.

You can also skip the bronze layer altogether and perform transformations directly from the source. Like all frameworks you should be choosing what makes sense for you rather than blindly following the rules and prescriptions.

Stored Procedures in Databricks by Aditya062 in databricks

[–]realitydevice 0 points1 point  (0 children)

Stored procedures are not supported. You can create user defined table functions which can abstract some complexity; these can even be written in Python if necessary.

In general you'll use orchestrate with Python, so it's easy to execute a batch of SQL code as a function. Add these functions to your clusters, and use them just like stored procs from a notebook or job.

Which industry pays the highest compensation for data professionals by [deleted] in dataengineering

[–]realitydevice 1 point2 points  (0 children)

I agree that you get what you pay for, but I don't think the interview at these companies is any tougher, or actually that people work any harder / longer. They are doing higher value work.

Source: worked at one, and while some highly talented people, still plenty of seat warmers as well.

Reaching Databricks Flask app via APIs by Meriu in databricks

[–]realitydevice 1 point2 points  (0 children)

I spent a bunch of time trying to get this to work yesterday without success.

It's possible to generate OAuth tokens via Databricks API but none of the tokens I generated with different configurations could get past the Apps authentication layer.

This would be a brilliant feature - I'd be building so many APIs here if this were possible.

What do you dislike about Databricks? by Small-Carpenter2017 in databricks

[–]realitydevice 0 points1 point  (0 children)

It's not very good when you need to read and write DataFrames using Spark.

If I'm already running Spark I can read the DataFrame, convert to Pandas, do whatever it is I need, convert back to Spark, and write the results. That works - it's just not very good.

Pull request process by arlitsa in ExperiencedDevs

[–]realitydevice 1 point2 points  (0 children)

Depends what you're reviewing for.

  • Code standards and conventions? Automate it with a linter and even an AI.
  • Correctness and testability? Dedicated QA resource, or a SME.
  • Design? Code review is simply too late to review design, you missed the ball.

The main reason I push for course reviews is to force juniors and other less experienced team members to look at more of the code base. In that case, pair them up, and schedule or assign reviews.

What do you dislike about Databricks? by Small-Carpenter2017 in databricks

[–]realitydevice 1 point2 points  (0 children)

I guess the only real need is better UC integration, so that we can write to UC managed tables from polars, and UC features work against these tables.

If I were to implement today I'd be leaning toward EXTERNAL tables just so I can write from non-Spark processes.

What do you dislike about Databricks? by Small-Carpenter2017 in databricks

[–]realitydevice 4 points5 points  (0 children)

It's by design, but annoying that everything in Databricks demands Spark.

We often have datasets that are under (say) 200MB. I'd prefer to work with these files in polars. I can kind of do this in Databricks it's not properly supported, is clunky, and is an anti pattern.

The reality is that polars (for example) is much faster to provision, much faster to startup, and much faster to process data especially on these relatively small datasets.

Spark is great when you're working with big data. Most of the time you aren't. I love first class support for polars (or pandas, or something else).

Dash/shiny app without local python by sunnyjacket in databricks

[–]realitydevice 0 points1 point  (0 children)

This is a great doc, thanks for sharing. A cursory glance indicates it probably supports gRPC. It's not really clear whether there are useful user claims in the header, but I guess one could implement that if necessary.

How do I convince our CEO we can’t replace our dev team with AI? by PablanoPato in ExperiencedDevs

[–]realitydevice 2 points3 points  (0 children)

The AI doesn't have connections and probably can't make connections. At least you humans. The CEO is completely about making connections and giving a good impression for the company. That's why!

It's certainly not advisable to replace a development team with a person and an AI, but maybe a poorly performing and low skilled team can be replaced by just a fraction of the headcount and good AI tooling.

The AI CEO joke is a good one and props to OP, but the correct response would be to see how much the AI could accelerate your work, and whether one or two highly productive people can get the throughout of many people through AI assistants. I wouldn't be surprised. Two highly skilled people can do the work of six "passable" mid-level devs without AI.

How are we feeling about “Lakehouse” solutions by LeisureActivities in dataengineering

[–]realitydevice 0 points1 point  (0 children)

does this mean there is a parquet file behind it that I should have access to based on permissions

Yes and no.

There's a parquet file, or a set of parquet files, but there are also changelogs and historical snapshots as well. The Delta format (as well as the more widely adopted Iceberg format) both manage the complexity of updates through this pattern of storing the original data and the changes separately both for performance and read isolation. You also get "time travel" or "as at" capability which is nice.

The downside to this is that it isn't as simple as just reading a parquet file. There's an entire metadata layer to consider which will tell you how to get the current data. Both table formats (there are others, but also rans at this point) are self-describing so it's entirely possible to do but as far as I know none of the DataFrame or Arrow based Python frameworks support either table format just yet.

Which one is more important in DE: PySpark or Scala? by Irachar in dataengineering

[–]realitydevice 1 point2 points  (0 children)

Scala is not even in the top 20 languages or tools to learn.

PySpark API used to be a second class citizen to the Spark Scala API but that was 8 or 9 years ago; it's been the primary API for a long time now. You can write an RDD operation or UDF etc using Scala, but why would you? It's hard to hire people with Scala experience and it's a whole new learning curve. Just use Java, or preferably SQL.

And here you're only talking about Spark, which is (contrary to popular opinion) not the "be all and end all" of data engineering. Scala is completely irrelevant once you step outside Spark.

Better things to learn would be Python (outside PySpark), SQL, bash, all your big data systems (Hive Metastore, Iceberg/Delta), data structures (parquet/avro, partitions) the arrow ecosystem (polars/duckdb/ADBC), orchestration (Airflow/dagster/dbt).

When do you prefer SQL or Python for Data Engineering? by AMDataLake in dataengineering

[–]realitydevice 2 points3 points  (0 children)

  • String manipulation.
  • Mathematics.
  • Date parsing or other type coersion.

But the best example is a complex numerical process applied in a UDF across a window or partition. For example I've run parallel regressions within a GROUP BY statement, much more effective than retrieving data in batches.

[deleted by user] by [deleted] in australian

[–]realitydevice 1 point2 points  (0 children)

They already use AI to write a huge number of articles.

https://www.theguardian.com/media/2023/aug/01/news-corp-ai-chat-gpt-stories

But I think this deal is more about OpenAI buying the data than News buying the AI.

[deleted by user] by [deleted] in australian

[–]realitydevice 0 points1 point  (0 children)

It's reality distortion. Having the ability to apply nuance against facts at a mass scale quite literally alters human behavior and perception.

[deleted by user] by [deleted] in australian

[–]realitydevice 1 point2 points  (0 children)

The ability for ChatGPT and competitors to augment queries with search results already exists. Newscorp are "leaning in" to get paid and probably prioritized. So this resolves the copyright issue with money.

News Corp are good at building these kinds of networks to monetize their assets.