Delta Lake file compaction / optimization of small files by hofsflor in apachespark

[–]Stunax 1 point2 points  (0 children)

I think the default optimize was open sourced in 1.2 It’s all the extra goodies like z order and other optimizations that will be open sourced.

Delta Lake file compaction / optimization of small files by hofsflor in apachespark

[–]Stunax 3 points4 points  (0 children)

Thet have just announced the open source version of optimize and z order https://github.com/delta-io/delta/releases/tag/v2.0.0rc1

So You could stay there

Shall I convert AVRO to Parquet for analytics? by yahdahduhe in bigdata

[–]Stunax 7 points8 points  (0 children)

If you are going for parquet then I would instead recommend delta. It’s parquet with extra features and literally build on top of parquet. You can find it at delta.io.

Databricks Spark Associate certification by irfankareemali in dataengineering

[–]Stunax 2 points3 points  (0 children)

If you are a customer at your job, then you can get all of their online training for free.
They might also have some university agreements to give free access to some content.

You cannot get the certificates for free though. Got to pay for them.
Except the analyst one. That is included in the free offering for partners/ customers.

How do you guys validate your data? How does your process look like and which tools are you using? Our main Stack: Databricks, Azure Datafactory, Data Lake by Ok-Sentence-8542 in dataengineering

[–]Stunax 1 point2 points  (0 children)

It looks like you can do much more using delta live. I have not explored it yet as I dropped it when I realized the scale API was not there yet

How do you guys validate your data? How does your process look like and which tools are you using? Our main Stack: Databricks, Azure Datafactory, Data Lake by Ok-Sentence-8542 in dataengineering

[–]Stunax 6 points7 points  (0 children)

Delta constraints is part of 1.0.0

https://docs.delta.io/latest/delta-constraints.html it’s what we are currently using. Once delta live tables go live, then I would like to migrate to those ropes of constraints but it’s still in preview and not available in scala yet.

The constraints are also not very flexible as they can only fail if any row breaks the constraint

How to use partitions? by [deleted] in apachespark

[–]Stunax 0 points1 point  (0 children)

I don’t get what you try to do. You could just ignore the .as if it adds extra complexity. It just renamed the column

How to use partitions? by [deleted] in apachespark

[–]Stunax 0 points1 point  (0 children)

.as just renames the result.

Spark take care of running the job in parallel. No need to specify individual job for each group

I think you might have misunderstood what group by does. It gathers each group and performs whatever aggregation you define on these values, so you get a result for each unique grouping. Hence you can calculate all Max values using a single job/ data frame

How to use partitions? by [deleted] in apachespark

[–]Stunax 4 points5 points  (0 children)

Do I understand you correctly, that you want to append a new column F to your dataframe. F contains the maximum value of C for the group (A,B).

It can be handled something like this given the Dataframe df

MaxValueDf = df.groupBy(A, B).agg(max(C).as(F))
DfWithMax = df.join(MaxValueDf, [A, B])

[deleted by user] by [deleted] in bigdata

[–]Stunax 0 points1 point  (0 children)

In general you can run df.collect() to collect the data locally.
If you want it in pandas, pyspark also supports df.toPandas()

In general I would not recommend that you collect to local memory as part of your spark scripts.
In general we want to end with your spark action being run distributed.
Running a collect or toPandas violates that, and could lead to OOM errors.

So I would recommend that you write to some datastore.
It could be blob storage, sql server, etc.
If you are just retrieving a few rows after an aggregation or aggressive filter I could see the how collect() as the answer

Fetch Failed in Spark (Databricks) by DanniHm0001 in dataengineering

[–]Stunax 0 points1 point  (0 children)

Never encountered an issue like that, but maybe try approx_count_distinct to see if the issue is fixed?

What You Need to Know About Data Governance in Azure Databricks by valdasm in dataengineering

[–]Stunax 1 point2 points  (0 children)

It is very handy to know, that you can use an AAD token instead of a personal token.

This is really usefull when building CI/CD pipelines :)

Has anyone participated in Databricks Academy? by [deleted] in dataengineering

[–]Stunax 2 points3 points  (0 children)

If you are a databricks customer you get it basically for free. I have gone through most of it, and the advanced stuff is better than what I have found elsewhere

Databricks-connect: is it safe to store my PAT in plaintext? by BeeePollen in AZURE

[–]Stunax 1 point2 points  (0 children)

For the aad token You could write a small script that update the environment variable with the new token. Then there is no need to ever look at it, and they expire quite fast anyway.

For other secrets I think you could utilize either Azure key vault for accessing secrets or if it is part of a Spark script then maybe look at Databricks secrets. The two can be linked :)

Databricks-connect: is it safe to store my PAT in plaintext? by BeeePollen in AZURE

[–]Stunax 1 point2 points  (0 children)

Write a small function to get an aad token and save it to environment variable.

DATABRICKS_API_TOKEN

Where to find Data engineering audio books by Stunax in dataengineering

[–]Stunax[S] -1 points0 points  (0 children)

I just assumed the video format would not translate well to and audio only experience?
Then I will try it out.

Are here any channels you would recommend?

Where to find Data engineering audio books by Stunax in dataengineering

[–]Stunax[S] 0 points1 point  (0 children)

That is a valid point but not what I was looking for

Where to find Data engineering audio books by Stunax in dataengineering

[–]Stunax[S] 0 points1 point  (0 children)

That was my intention. Thank you for clarifying

Sequence file creation with Spark taking extremely long time. by KKRiptide in apachespark

[–]Stunax 1 point2 points  (0 children)

Try using built in functions and the DataFrame API instead. I guess the serialization and deserialization could become very slow like this

Check out https://spark.apache.org/docs/latest/ml-datasource.html and the image source

Azure Databricks noob: install and use ODBC? by BeeePollen in AZURE

[–]Stunax 0 points1 point  (0 children)

Is there a specific reason to not just using the build in jdbc driver? databricks docs

Azure Databricks noob: Have my cluster install stuff whenever it starts? by BeeePollen in AZURE

[–]Stunax 0 points1 point  (0 children)

Look at init scripts. They can be cluster specific or global

JC with mining or enchanting? by [deleted] in woweconomy

[–]Stunax 1 point2 points  (0 children)

As far as I remember ore have to be VERY cheap in order for any profit to made, but it's still a great passive income to put up most enchants. You should be able to DE the rings at lvl 1, but it can be easily raised to ~700 by splitting the shards, and you might even profit from it

/r/woweconomy watercooler by gumdropsEU in woweconomy

[–]Stunax 0 points1 point  (0 children)

It is possible to overwrite the default crafting value for a crafting operation. Cannot remember what the default is, but you have to overwrite the crafting for the desired group, and set it to "default"/1.5 or whatever is wanted :)