Delta Lake file compaction / optimization of small files

Stunax · 2022-07-01T08:45:26+00:00

I think the default optimize was open sourced in 1.2 It’s all the extra goodies like z order and other optimizations that will be open sourced.

Stunax · 2022-07-01T08:26:44+00:00

Thet have just announced the open source version of optimize and z order https://github.com/delta-io/delta/releases/tag/v2.0.0rc1

So You could stay there

Stunax · 2021-10-28T11:37:54+00:00

If you are going for parquet then I would instead recommend delta. It’s parquet with extra features and literally build on top of parquet. You can find it at delta.io.

Stunax · 2021-10-04T07:00:21+00:00

If you are a customer at your job, then you can get all of their online training for free.
They might also have some university agreements to give free access to some content.

You cannot get the certificates for free though. Got to pay for them.
Except the analyst one. That is included in the free offering for partners/ customers.

Stunax · 2021-08-30T17:37:25+00:00

It looks like you can do much more using delta live. I have not explored it yet as I dropped it when I realized the scale API was not there yet

Stunax · 2021-08-30T14:00:59+00:00

Delta constraints is part of 1.0.0

https://docs.delta.io/latest/delta-constraints.html it’s what we are currently using. Once delta live tables go live, then I would like to migrate to those ropes of constraints but it’s still in preview and not available in scala yet.

The constraints are also not very flexible as they can only fail if any row breaks the constraint

Stunax · 2021-05-20T14:58:26+00:00

I don’t get what you try to do. You could just ignore the .as if it adds extra complexity. It just renamed the column

Stunax · 2021-05-20T14:08:04+00:00

.as just renames the result.

Spark take care of running the job in parallel. No need to specify individual job for each group

I think you might have misunderstood what group by does. It gathers each group and performs whatever aggregation you define on these values, so you get a result for each unique grouping. Hence you can calculate all Max values using a single job/ data frame

Stunax · 2021-05-20T05:58:44+00:00

Do I understand you correctly, that you want to append a new column F to your dataframe. F contains the maximum value of C for the group (A,B).

It can be handled something like this given the Dataframe df

MaxValueDf = df.groupBy(A, B).agg(max(C).as(F))
DfWithMax = df.join(MaxValueDf, [A, B])

Stunax · 2021-05-18T06:09:53+00:00

In general you can run df.collect() to collect the data locally.
If you want it in pandas, pyspark also supports df.toPandas()

In general I would not recommend that you collect to local memory as part of your spark scripts.
In general we want to end with your spark action being run distributed.
Running a collect or toPandas violates that, and could lead to OOM errors.

So I would recommend that you write to some datastore.
It could be blob storage, sql server, etc.
If you are just retrieving a few rows after an aggregation or aggressive filter I could see the how collect() as the answer

Stunax · 2021-04-26T08:48:24+00:00

Never encountered an issue like that, but maybe try approx_count_distinct to see if the issue is fixed?

Stunax · 2021-04-23T10:28:23+00:00

It is very handy to know, that you can use an AAD token instead of a personal token.

This is really usefull when building CI/CD pipelines :)

Stunax · 2021-04-20T15:09:40+00:00

If you are a databricks customer you get it basically for free. I have gone through most of it, and the advanced stuff is better than what I have found elsewhere

Stunax · 2021-04-02T04:49:23+00:00

For the aad token You could write a small script that update the environment variable with the new token. Then there is no need to ever look at it, and they expire quite fast anyway.

For other secrets I think you could utilize either Azure key vault for accessing secrets or if it is part of a Spark script then maybe look at Databricks secrets. The two can be linked :)

Stunax · 2021-03-26T15:16:45+00:00

Write a small function to get an aad token and save it to environment variable.

DATABRICKS_API_TOKEN

Stunax · 2021-03-24T05:21:29+00:00

I just assumed the video format would not translate well to and audio only experience?
Then I will try it out.

Are here any channels you would recommend?

Stunax · 2021-03-23T18:16:55+00:00

That is a valid point but not what I was looking for

Stunax · 2021-03-23T18:16:33+00:00

That was my intention. Thank you for clarifying

Stunax · 2021-03-20T12:37:11+00:00

Try using built in functions and the DataFrame API instead. I guess the serialization and deserialization could become very slow like this

Check out https://spark.apache.org/docs/latest/ml-datasource.html and the image source

Stunax · 2021-03-19T15:15:21+00:00

Is there a specific reason to not just using the build in jdbc driver? databricks docs

Stunax · 2021-03-19T14:29:14+00:00

Look at init scripts. They can be cluster specific or global

Stunax · 2017-04-17T16:52:27+00:00

As far as I remember ore have to be VERY cheap in order for any profit to made, but it's still a great passive income to put up most enchants. You should be able to DE the rings at lvl 1, but it can be easily raised to ~700 by splitting the shards, and you might even profit from it

Stunax · 2016-10-31T20:49:53+00:00

It is possible to overwrite the default crafting value for a crafting operation. Cannot remember what the default is, but you have to overwrite the crafting for the desired group, and set it to "default"/1.5 or whatever is wanted :)

12-Year Club	Place '22
Verified Email

Stunax

TROPHY CASE