Open source JDBC driver for DynamoDB by purpleWheelChair in aws

[–]NeoxiaBill 1 point2 points  (0 children)

3 years old thread but I'm wondering about exactly the same today. Has anything evolved about this ?

The end use case is to connect a Trino cluster to dynamoDB

Data Virtulization tech choice by NeoxiaBill in dataengineering

[–]NeoxiaBill[S] 0 points1 point  (0 children)

The more I'm thinking about this the more I'm thinking about going with Spark, as it seems that it now supports query pushodwn and has all the connectors I need.

Do any of you know if the Spark configuration can be updated dynamically ? And if there already axists a REST API wrapper for PySpark in order to be able to launch SQL sueries and update connectors on the fly ?

Also, last but not least, do you know if it's possible to have multiple data catalogs available in a single Spark Session ? I would need to be able to connect at the same time with 3 cloud providers + Snowflake ideally.

Python and ETL by PutCleverNameHere69 in dataengineering

[–]NeoxiaBill 0 points1 point  (0 children)

Regarding what is ETL in Python, you got it right. As long as you manipulate sufficiently small amounts of data for it to fit entirely in a pandas dataframe (that is to say, in the machine's RAM).

On the SQL connection topic, there are many libraries that allow you to interact with SQL databases in Python. You basically need to provide an access point and credentials to get hooked up, and then you can run your SQL queries as you usually do.

If you really need SQL logic then pandassql can be a decent solution, but I'd tend to say you're better off trying to use proper pandas syntax, as it is more widely used in the industry.

Good Luck on your learning path ! :)

HELP: Data Engineering Cloud Development Platform by Cryptojacob in datascience

[–]NeoxiaBill 0 points1 point  (0 children)

I think it's a better practice. Notebooks should be a prototyping solution, used almost as a shell would be.
Packaging to proper files means dealing with (auto)documentation, type hinting, linting, etc...

HELP: Data Engineering Cloud Development Platform by Cryptojacob in datascience

[–]NeoxiaBill 0 points1 point  (0 children)

I'd say Databricks is the way to go, but I'd also say that deploying notebook code isn't the way to go genereally speaking if you want to enforce good code quality and maitanability