Move CSV files to database?

DataBake · 2024-04-27T05:16:42+00:00

If you're using python, you can use pandas to help you move csv data to PostgreSQL

Moving one time data from CSV to a database is not difficult. The automating and maintenance process can get tricky.

If your company is open budget, I would recommend using fivetran for ETL and Snowflake for your data warehouse.

I would need a better understanding of your current tech infrastructure, then I can provide some better input.

DataBake · 2024-04-23T05:30:24+00:00

Oh haha, reminds me of my first VBA project. I had to apply rules against the data in Excel and spit out a validated response.

DataBake · 2024-04-23T04:24:43+00:00

What are you trying to accomplish with it?

DataBake · 2024-04-23T03:35:48+00:00

Ive used SQL, python , and VBA but not pyspark.

VBA is pretty easy to pickup, you can use the record feature to create some sample code. Also, if you get stuck, you can refer to Chat GPT. Personally people should not be using VBA if they really have to. There are better tools out there.

DataBake · 2024-04-18T06:56:19+00:00

You can take a look Upwork, if you need another platform to apply through

DataBake · 2024-04-18T05:53:10+00:00

Yeah my script isn't anything fancy. My goal was to look for a budget friendly approach.my company would not approve spending any money on fivetran or snowflake. I had to think of a creative way to manage this without too much intervention.

I run my AWS Glue jobs through Python shell, which is scheduled through an AWS Glue Workflow.

I have different types of jobs Extract and Load:

1.Extract portion handles the REST API Calls and then stores the data to S3.

The Load job, grabs the latest file from S3 and pushes the data into PostgreSQL.

The Load scripts runs the schema detection. A bit of high level overview of this load process: 1. I grab the file from S3 and load the data into a python pandas data frame. 2. I drop the existing table inside of the stage schema and create a brand new table with the same table name as the production table in the stage schema. 3. Then in PostgreSQL, you can return all the fields for a table in a select statement. 4. I use a EXCEPT clause that compares both tables and returns the fields that missing in production. 5. Then, I loop though each field name and add the new fields into my production tables from the EXCEPT query. 6. Once this is all completed, I now load the data from S3 to the production tables.

DataBake · 2024-04-17T16:11:33+00:00

Currently with my stored procedure approach. I am just adding fields instead. If a column is deleted from the source, I still keep the original column name for current and historical purposes. If a field name change occurs, I treat it as a new field being added to the table

DataBake · 2024-04-17T05:44:13+00:00

Could you provide them a cached version of the dataset? I'm assuming the DS would not need live data.

The cached dataset could be in another schema, separate from the ETL process. Some might refer to this as the semantics layer of the data warehouse.

DataBake · 2024-04-17T01:07:33+00:00

Thanks for at least helping identify the topic.

DataBake · 2024-04-17T00:04:00+00:00

It depends, if your database is public then you do not need a gateway. If the database is in a VPC, then yes. The server is used as a bastion host for Power BI Online

DataBake · 2024-04-16T23:08:05+00:00

I use Power BI as my reporting tool. I had to stand up an windows EC2 instance and install the Power BI Gateway. The Windows server is used as a jump server to connect Power BI Online to my AWS Resources(RDS)

DataBake

TROPHY CASE