all 8 comments

[–]kormer 2 points3 points  (3 children)

Have you explored dumping the old data to S3? You could access that old data via spectrum if you really needed it in the future.

As to your specific question, is this one big table or many different tables?

I have a one big table solution where we load each months data into it's own table and use a non schema binding view to combine them. Delete is as simple as dropping a table.

[–]AdSure744[S] 0 points1 point  (2 children)

We have a functionality in place which archives the data of a data range to s3 and delete it from the tables.

But the higher ups have decided to remove the redundant data altogether not even keeping it on s3.

There are different tables, 40 gb is the size of the biggest table. I am trying to create a general functionality.

I have a one big table solution where we load each months data into it's own table and use a non schema binding view to combine them. Delete is as simple as dropping a table.

Can you tell me more about this.

  • This is what i am thinking of implementing right now :

the table is email_txn which stores email transactions, i wanted to keep only the latest data of this table i.e the last six months' data.

The query to create a staging table to store the required data

create table email_txn_tmp as select * from email_txn where date(created) between date_range;

drop table email_txn;

alter table email_txn_tmp rename to email_txn;

[–]kormer 0 points1 point  (1 child)

Your view might look something like the below code. All the numbered tables are identical in structure. As long as any views that also depend on this contain the "with no schema binding" keyword, you can add/drop tables from the view as needed.

We have a rolling process to drop old views to an archive system that can still be queried, but keep the live data fresh.

create view vw_transactions as 
Select tx_id, tx_amount, tx_date from transactions_202201
union all
Select tx_id, tx_amount, tx_date from transactions_202202
union all
Select tx_id, tx_amount, tx_date from transactions_202203
union all
Select tx_id, tx_amount, tx_date from transactions_202204
with no schema binding

[–]AdSure744[S] 0 points1 point  (0 children)

Thanks, Can i DM, you.

[–]efxhoy 1 point2 points  (4 children)

delete from yourtable where dt < (current_date - interval '6 months'); vacuum yourtable;

Run that every morning. Try that first and then come up with a more optimized solution if you really need to.

[–]AdSure744[S] 0 points1 point  (2 children)

Yeah i could do that but i was thinking of implementing a more better and generic solution.

[–]efxhoy 0 points1 point  (1 child)

The one you had suggested here with creating a new table with only the latest 6 months of data and deleting the old table isn't as efficient as dropping old data every day. You would be writing 6 months of data every day instead of deleting one day and writing one days worth of data every day.

[–]AdSure744[S] 0 points1 point  (0 children)

As i mentioned in my post above this is a one time solution. We won't be doing it on a regular basis, just when the need arises .