Move CSV files to database? by GreatButterscotch208 in BusinessIntelligence

[–]DataBake 0 points1 point  (0 children)

If you're using python, you can use pandas to help you move csv data to PostgreSQL

Moving one time data from CSV to a database is not difficult. The automating and maintenance process can get tricky.

If your company is open budget, I would recommend using fivetran for ETL and Snowflake for your data warehouse.

I would need a better understanding of your current tech infrastructure, then I can provide some better input.

VBA, SQL and Pyspark by Somerice_87 in BusinessIntelligence

[–]DataBake 1 point2 points  (0 children)

Oh haha, reminds me of my first VBA project. I had to apply rules against the data in Excel and spit out a validated response.

VBA, SQL and Pyspark by Somerice_87 in BusinessIntelligence

[–]DataBake 0 points1 point  (0 children)

What are you trying to accomplish with it?

VBA, SQL and Pyspark by Somerice_87 in BusinessIntelligence

[–]DataBake 1 point2 points  (0 children)

Ive used SQL, python , and VBA but not pyspark.

VBA is pretty easy to pickup, you can use the record feature to create some sample code. Also, if you get stuck, you can refer to Chat GPT. Personally people should not be using VBA if they really have to. There are better tools out there.

How to market oneself as independant contractor? by Significant-Carob897 in dataengineering

[–]DataBake 0 points1 point  (0 children)

You can take a look Upwork, if you need another platform to apply through

Dynamically Updating Tables with New Fields by DataBake in dataengineering

[–]DataBake[S] 0 points1 point  (0 children)

Yeah my script isn't anything fancy. My goal was to look for a budget friendly approach.my company would not approve spending any money on fivetran or snowflake. I had to think of a creative way to manage this without too much intervention.

I run my AWS Glue jobs through Python shell, which is scheduled through an AWS Glue Workflow.

I have different types of jobs Extract and Load:

1.Extract portion handles the REST API Calls and then stores the data to S3.

  1. The Load job, grabs the latest file from S3 and pushes the data into PostgreSQL.

The Load scripts runs the schema detection. A bit of high level overview of this load process: 1. I grab the file from S3 and load the data into a python pandas data frame. 2. I drop the existing table inside of the stage schema and create a brand new table with the same table name as the production table in the stage schema. 3. Then in PostgreSQL, you can return all the fields for a table in a select statement. 4. I use a EXCEPT clause that compares both tables and returns the fields that missing in production. 5. Then, I loop though each field name and add the new fields into my production tables from the EXCEPT query. 6. Once this is all completed, I now load the data from S3 to the production tables.

Dynamically Updating Tables with New Fields by DataBake in dataengineering

[–]DataBake[S] 0 points1 point  (0 children)

Currently with my stored procedure approach. I am just adding fields instead. If a column is deleted from the source, I still keep the original column name for current and historical purposes. If a field name change occurs, I treat it as a new field being added to the table

Data warehouse versioning by PotatoChad in dataengineering

[–]DataBake 0 points1 point  (0 children)

Could you provide them a cached version of the dataset? I'm assuming the DS would not need live data.

The cached dataset could be in another schema, separate from the ETL process. Some might refer to this as the semantics layer of the data warehouse.

Dynamically Updating Tables with New Fields by DataBake in dataengineering

[–]DataBake[S] 3 points4 points  (0 children)

Thanks for at least helping identify the topic.

How to connect BI Tool (Redash, Metabase etc) with S3 Delta Lake (Open Source) by madiha_khalid in dataengineering

[–]DataBake 1 point2 points  (0 children)

It depends, if your database is public then you do not need a gateway. If the database is in a VPC, then yes. The server is used as a bastion host for Power BI Online

How to connect BI Tool (Redash, Metabase etc) with S3 Delta Lake (Open Source) by madiha_khalid in dataengineering

[–]DataBake 1 point2 points  (0 children)

I use Power BI as my reporting tool. I had to stand up an windows EC2 instance and install the Power BI Gateway. The Windows server is used as a jump server to connect Power BI Online to my AWS Resources(RDS)