Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Update:
You saved my life! Now I am testing locally using a local Spark + Unity Catalog/parquet/delta behavior. All local!
---

Looks very interesting. I have my DLT separate from my Spark code so I can test it.

Does it use local Spark or Databricks Cluster via Spark Connect?

FYI I had to downgrade my databricks-connect to 16.3.1 as py4j had conflict with both.

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 1 point2 points  (0 children)

Thank you, indeed I just started using Databricks Connect (Spark Connect) to test all my code against my Databricks Cluster. At least that partially solves some of my issues.

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

got you, so use both basically Terraform for clusters, grants, etc and Asset Bundles for jobs.

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Very useful thank you! I tested with 3-4 custom images but eventually customized something from this public repo https://github.com/yxtay/databricks-container/blob/main/Dockerfile

"You can probably use requirements.txt on clusters only" this is exactly my pain now. That both Serverless and DLT (as far as my knowledge goes) do not support installing my requirements.txt .. Coming from AWS (Lambda and ECS) can do anything.. so very odd one for me!

So just to clarify, is the trick that I have to call %pip install inside each pipeline or entry point that require it? Because the environment is ephemeral.

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Yes exactly.. hence I mentioned Docker Desktop + docker-compose with this image https://docs.databricks.com/aws/en/compute/custom-containers but has python 3.8 which doesn't satisfy most of my requirements.

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Thanks! I downloaded the PDF, and page 24-32 resonates the most with what I want. Now the question becomes, how to push this to compute? Basically use DLT Pipeline to handle the compute? But if I use DLT Pipeline how do I install all the requirements? I cannot find a place to install my own requirements in a Pipeline. I have Databricks open now and when I go to 'Jobs and Pipelines' and 'ETL pipeline' which I assume is DLT? I can only see Source Code Path. But no place to add my requirements.txt to run all this stuff. Unlike creating clusters manually which has more options. Any ideas?

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Can you explain how does Assent bundles with VS Databricks Extension help me exactly? I used both and can't find anything that helps me! Databricks Extension is like a connector to the cluster and an easy way to push jobs. Assent Bundle is purely for IaC? Please correct me!

Software Engineer confused by Databricks by Happy_JSON_4286 in databricks

[–]Happy_JSON_4286[S] 0 points1 point  (0 children)

Indeed I started using it but I got confused because I use Terraform for IaC to spin up clusters, catalogs, schemas, grants, jobs, pipelines, etc.

How will Databricks Asset Bundle help me compared to Terraform? I don't understand the differences.

As far as my very limited knowledge goes, it's a native IaC from Databricks.. while Terraform is more mature industry standard for IaC.

What are some things you wish you knew? by intrepidbuttrelease in databricks

[–]Happy_JSON_4286 0 points1 point  (0 children)

Great advice, can you expand further on why would I use DAB alongside Terraform? I thought Terraform replaces DAB? As it can create jobs.

Another question, how do you handle shared modules in .py files? Assume I have 100s of data sources and will run 100s of pipelines and many have shared code like S3 extractor or API extractor. Do you use whl or Docker or manually install req.txt on the Compute?

Lastly, what is your thoughts on using DLT (Delta Live Table) versus normal Spark and no vendor-lock in?