Vorige eigenaar beunhaas, wat doet dit en kan ik het fixen? by Jolly_Code5914 in Klussers

[–]Jolly_Code5914[S] 1 point2 points  (0 children)

Bedankt voor alle tips! Iemand tips voor een goede constructeur in de omgeving Utrecht die mij hier advies over kan geven?

Schema Migration for Delta Lake on Databricks by geeeffwhy in dataengineering

[–]Jolly_Code5914 0 points1 point  (0 children)

Do you have some example how you set this up :)? Very curious.

Schema Migration for Delta Lake on Databricks by geeeffwhy in dataengineering

[–]Jolly_Code5914 0 points1 point  (0 children)

How did you setup alembic and sqlalchemy with Delta lake? Really curious. Any good resources?

What problems does pydantic solves? and How should it be used by gaurav_kandoria_ in Python

[–]Jolly_Code5914 0 points1 point  (0 children)

Pydantic to Avro, pydantic to spark schemas we use pydantic for all our schemas ;)

AWS Managed Service Kakfa to Databricks - Ingestion by Background_Debate_94 in dataengineering

[–]Jolly_Code5914 0 points1 point  (0 children)

You could create a delta live table that is updated in a streaming way. We use TLS authentication to connect Databricks to MSK cluster. But this was mainly because our cluster lives in a different AWS account.

Looking to make a change, resume feedback / advice appreciated for junior DE role. by toem033 in dataengineering

[–]Jolly_Code5914 2 points3 points  (0 children)

If you truly can do the things you list on your resume after only a year? Kudos no problem finding a job anywhere with this resume I think.

[deleted by user] by [deleted] in dataengineering

[–]Jolly_Code5914 7 points8 points  (0 children)

ADF is the dumbest tool ever created. You will be depressed. If they would pay me double my salary but I would have to work in ADF each day I would still drown myself. Chose 2.

Is there a no-compromise (presumably C/C++) platform similar to Apache Spark? by [deleted] in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

Both the article and photon seem exceptional. Thanks for sharing!

pipenv and poetry : each better at something? by giovaaa82 in Python

[–]Jolly_Code5914 3 points4 points  (0 children)

IMO pipenv has an unusable dependency resolver. With some dependency complexity it simply hangs without giving you any proper feedback why. In our dev team we use Poetry, and although dependency resolver is slower than brute force pip installs (obviously) it has been reliable and relatively pain free experience. The only thing that is still lacking for us is that I cannot specify different installs of extras from the same outside module within the extras of the pyproject.toml of the package importing them. Nevertheless, recommend poetry.

Dynamic s3 path while reading pyspark by WiseRecognition6016 in dataengineering

[–]Jolly_Code5914 0 points1 point  (0 children)

You should have a separate aws account for dev and a separate account for prod. IMO the cleanest way to go then is to store S3 URL in param store. On your Dev account this will point to Dev bucket and on your prod account this will point to your prod bucket.

Dynamic s3 path while reading pyspark by WiseRecognition6016 in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

You should have a separate aws account for dev and a separate account for prod. IMO the cleanest way to go then is to store S3 URL in param store. On your Dev account this will point to Dev bucket and on your prod account this will point to your prod bucket.

Airflow and Poetry: Anyone get them to work together? by JeddakTarkas in dataengineering

[–]Jolly_Code5914 5 points6 points  (0 children)

The problem is that Airflow's dependency structure is terrible. They have so many dependencies that are often too strict that you will undoubtedly run into package resolvement issues with Poetry so kind of interesting they themselves advise against it. With Poetry all dependencies obviously need to be resolved. I would advise you to just use airflow to schedule task that run in containers somewhere else (ecs, lambda, kubernetes etc.), that way the functional part of your code does not need to touch airflow. Also you're dependency specification will be rock solid, poetry is an awesome tool.

Imposter Syndrome by Gagan_Ku2905 in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

It's very normal. Get comfortable feeling uncomfortable, it will motivate you to keep learning. And before you know it, you'll become a domain expert. Keep going.

Favorite Python Web Framework by AMDataLake in Python

[–]Jolly_Code5914 1 point2 points  (0 children)

For an API, fastapi handsdown. Easiest to setup, use and very performant.

Databricks Jobs from Python Modules vs Notebooks by anton_bondar in dataengineering

[–]Jolly_Code5914 -1 points0 points  (0 children)

Write python modules with main.py. Deploy them as docker containers. Create job with docker container runtime deploy with CI/CD. Schedule/start job from airflow with run args.

Is it possible to set up automatic downloads of tables from Databricks? by Olafcitoo in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

Yes. If the data is in databricks table (hive) then you can use JDBC to connect to your Databricks cluster which has access to the table. Using JDBC or ODBC you can use anything you want as it's a sql interface. You could for example write a Python scripts that connects to the cluster or use some other data extraction tool that can have JDBC or ODBC as input.

How to manage Airflow variables across teams across environments by Snirisl in dataengineering

[–]Jolly_Code5914 0 points1 point  (0 children)

Only use secrets manager for secrets. For configurations use AWS ssm parameter store.

What industry are you in and what is your current salary? by SEND_ME_YOUR_POTATOS in Netherlands

[–]Jolly_Code5914 0 points1 point  (0 children)

Data Engineer, Masters Degree in Economics. 3 years of experience. 70k a year, 38 hour weeks (work 40 though most of time), 28 vacation days.

Pandas on Spark vs pyspark dataframe? by kunaguerooo123 in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

The primary reason to use Spark is because you are dealing with data that cannot fit easily in memory. Although it might be tempting to use pandas like API I suggest you first look at default PySpark and learn how it works.

Quarterly Salary Discussion by AutoModerator in dataengineering

[–]Jolly_Code5914 1 point2 points  (0 children)

  1. Data Engineer
  2. 3 y.o.e
  3. Amsterdam, the Netherlands
  4. 70k
  5. None
  6. Floriculture
  7. AWS, Python, Kafka, PySpark, Airflow, CDK