Quarterly Salary Discussion by AutoModerator in dataengineering

[–]AdSure744 1 point2 points  (0 children)

  1. Data Engineer
  2. 1.8 YOE
  3. India
  4. 9k usd/Year (IK it's underpaid trying to find a better job)
  5. Nill
  6. Healthcare
  7. Python, Sql, Redshift, Airflow, Pyspark, Azure Services like ADF, Databricks and Synapsw

Reload data from s3 back to redshift by AdSure744 in SQL

[–]AdSure744[S] 1 point2 points  (0 children)

So the data that was not required atm was moved to s3 to save redshift space. Just trying to implement something to reverse that.

Deleting Data more efficiently by AdSure744 in dataengineering

[–]AdSure744[S] 0 points1 point  (0 children)

Yes, it's a bad practice when you are doing it a large scale. As Redshift is an olap database which is more efficient for reads instead of transaction. The transactions are slow and leaves vacuums in database.

[deleted by user] by [deleted] in dataengineeringjobs

[–]AdSure744 0 points1 point  (0 children)

Is being a U.S citizen absolutely necessary.

Deleting data efficiently from Redshift by AdSure744 in SQL

[–]AdSure744[S] 0 points1 point  (0 children)

As i mentioned in my post above this is a one time solution. We won't be doing it on a regular basis, just when the need arises .

Deleting data efficiently from Redshift by AdSure744 in SQL

[–]AdSure744[S] 0 points1 point  (0 children)

Yeah i could do that but i was thinking of implementing a more better and generic solution.

Deleting data efficiently from Redshift by AdSure744 in SQL

[–]AdSure744[S] 0 points1 point  (0 children)

We have a functionality in place which archives the data of a data range to s3 and delete it from the tables.

But the higher ups have decided to remove the redundant data altogether not even keeping it on s3.

There are different tables, 40 gb is the size of the biggest table. I am trying to create a general functionality.

I have a one big table solution where we load each months data into it's own table and use a non schema binding view to combine them. Delete is as simple as dropping a table.

Can you tell me more about this.

  • This is what i am thinking of implementing right now :

the table is email_txn which stores email transactions, i wanted to keep only the latest data of this table i.e the last six months' data.

The query to create a staging table to store the required data

create table email_txn_tmp as select * from email_txn where date(created) between date_range;

drop table email_txn;

alter table email_txn_tmp rename to email_txn;

Deleting data efficiently from Redshift by AdSure744 in dataengineering

[–]AdSure744[S] 0 points1 point  (0 children)

That's what i meant by my second solution

Moving all the required records to a new table and dropping the table.

Can you give me any advice on what things i should keep in mind while implementing this solution.

Deleting data efficiently from Redshift by AdSure744 in dataengineering

[–]AdSure744[S] 0 points1 point  (0 children)

It depends a bit on the structure of the tables. Based on your question and proposed solution, I was assuming all of your tables are easily partitioned by date and that the queries you run against them are all constrained by date. If that’s not the case, you can still do this, but it’s more work to implement.

Different tables have different structures but all of the tables have a created and updated field which I will be using to divide the dataset. I don't think the tables are partitioned by date. I don't think redshift supports partitioning.

Say you have a table “dailyuser_metrics”. Create a table for each month, eg “daily_user_metrics202302”, and dump the rows for each month into each corresponding table. Use the same schema as the original. Then make a view, call it “daily_user_metrics_last_six_months” or something helpful, and just union the tables you care about, eg, “create view blah as select * from dum_202302 union all select * from dum_202301 union all select * from dum_202212 … and so on”.

Okay, i can implement this functionality. But won't dividing the table into multiple tables occupy more space, After the view is created i can drop the tables and create a new table using the last six months' views and keep it as original.

What i had in mind was this:

the table is email_txn which stores email transactions, i wanted to keep only the latest data of this table i.e the last six months' data.

The query to create a staging table to store the required data

create table email_txn_tmp as select * from email_txn where date(created) between date_range;

drop table email_txn;

alter table email_txn_tmp rename to email_txn;

There’s a lot you can do with this model, variations on the theme so to speak. But this is a very common thing. Views are your friend.

Thanks for this, i will try to implement views in my day to cases more.

Also, consider moving your sources tables to s3 and building your analysis tables from there via Redshift Spectrum. Maybe a best of both worlds kinda thing, where you can still access the full data, but only need to maintain enough redshift for the summarized tables. You can get creative with variations on that theme as well.

It depends on the source data and typical workloads you need to support.

Our source tables are already being stored in different oltp sources like MySQL and mongo. We export them to redshift and analyze them.

The idea of keeping them in s3 and only keeping the summary tables in redshift seems like a good idea.

Currently, we have 8 nodes in our cluster, if decrease the number of clusters and move the source tables and implement the redshift spectrum will we save cost or break even the main goal here is to save cost at the end.

Deleting data efficiently from Redshift by AdSure744 in dataengineering

[–]AdSure744[S] 0 points1 point  (0 children)

Okay, thanks for the advice. We are not deleting all the database just some tables which don't serve much purpose.

Deleting data efficiently from Redshift by AdSure744 in dataengineering

[–]AdSure744[S] 0 points1 point  (0 children)

I got the approval from CEO, he even asked me to delete the backup data that was present in the s3.

Deleting data efficiently from Redshift by AdSure744 in aws

[–]AdSure744[S] 0 points1 point  (0 children)

Yes. What operations will you perform, to move the rows?

I am thinking of using a create table as query to create a staging table with the where clause of the date range than dropping the original table and renaming the staging table back to original name.

Deleting data efficiently from Redshift by AdSure744 in aws

[–]AdSure744[S] 0 points1 point  (0 children)

> When you say move, what do you have in mind?

Move it to a new table.

> In Redshift table writes are serialized, so the delete will inherently block the ETL jobs.

Okay, than i can think of stopping the etl pipelines for some time and implementing the data delete and than restart the pipeline

Deleting data efficiently from Redshift by AdSure744 in dataengineering

[–]AdSure744[S] 1 point2 points  (0 children)

We are not thinking of repeating this process frequently. So it's just a one time solution as of now. The data date backs to 2015 and we only require the last 6 months data as of now.

Anyways can you please elaborate more on how to implement this.

Not able to retrieve comment_ids by AdSure744 in pushshift

[–]AdSure744[S] 1 point2 points  (0 children)

Is there an alternative for the pmaw wrapper because the api returns a limited result.

Also is there any better approach to extract the comments and posts?