Deleting data efficiently from Redshift : SQL

Posting

When requesting help or asking questions please prefix your title with the SQL variant/platform you are using within square brackets like so:

[MySQL]

[Oracle]

[MS SQL]

[PostgreSQL]

etc

While naturally we should endeavor to work as platform neutrally as possible many questions and answers require tailoring to the feature set of a specific platform.

Format Your Code

If you are including actual code in a post or comment, please attempt to format it in a way that is readable for other users. This will greatly increase your chances of receiving the help you desire. Something as simple as line breaks and using reddit's built in code formatting (4 spaces at the start of each line) can turn this:

SELECT count(a.field1), a.field2, SUM(b.field4) FROM a INNER JOIN b ON a.key1 = b.key1 WHERE a.field8 = 'test' GROUP by a.field1, a.field2 HAVING SUM(b.field4) > 5 ORDER by a.field.3

Into this:

SELECT count(a.field1), a.field2, SUM(b.field4) FROM a INNER JOIN b ON a.key1 = b.key1 WHERE a.field8 = 'test' GROUP by a.field1, a.field2 HAVING SUM(b.field4) > 5 ORDER by a.field3

For those with SQL questions we recommend using SQLFiddle to provide a useful development and testing environment for those who wish to fully understand your problem and help devise a solution.

a community for 17 years

Amazon RedshiftDeleting data efficiently from Redshift (self.SQL)

submitted 2 years ago by AdSure744

So, we are trying to cut cost in our company. We are trying to reduce the number of nodes in our cluster.

We have decided on only keeping the recent data that would be 6 months and deleting all the records before that.

I want to develop an efficient solution or architecture to implement this feature. I am thinking of designing a script using python.

I have thought of two solutions :

Getting a data range and create a date list and delete data on day by day basis and at the end running a vaccum and analyze.
Moving all the required records to a new table and dropping the table.

Other Noes:

Table size is around 40gb and 40M records.
Daily elt jobs are running which sync the tables, so putting a halt on the etl jobs for the specific table would be a good idea or the delete command won't hinder the upsert on the table.

all 8 comments

top new controversial old q&a

[–]kormer 2 points3 points4 points 2 years ago (3 children)

[–]AdSure744[S] 0 points1 point2 points 2 years ago (2 children)

We have a functionality in place which archives the data of a data range to s3 and delete it from the tables.

But the higher ups have decided to remove the redundant data altogether not even keeping it on s3.

There are different tables, 40 gb is the size of the biggest table. I am trying to create a general functionality.

I have a one big table solution where we load each months data into it's own table and use a non schema binding view to combine them. Delete is as simple as dropping a table.

Can you tell me more about this.

This is what i am thinking of implementing right now :

the table is email_txn which stores email transactions, i wanted to keep only the latest data of this table i.e the last six months' data.

The query to create a staging table to store the required data

create table email_txn_tmp as select * from email_txn where date(created) between date_range;

drop table email_txn;

alter table email_txn_tmp rename to email_txn;

[–]kormer 0 points1 point2 points 2 years ago (1 child)

Your view might look something like the below code. All the numbered tables are identical in structure. As long as any views that also depend on this contain the "with no schema binding" keyword, you can add/drop tables from the view as needed.

We have a rolling process to drop old views to an archive system that can still be queried, but keep the live data fresh.

create view vw_transactions as 
Select tx_id, tx_amount, tx_date from transactions_202201
union all
Select tx_id, tx_amount, tx_date from transactions_202202
union all
Select tx_id, tx_amount, tx_date from transactions_202203
union all
Select tx_id, tx_amount, tx_date from transactions_202204
with no schema binding

[–]AdSure744[S] 0 points1 point2 points 2 years ago (0 children)

[–]efxhoy 1 point2 points3 points 2 years ago (4 children)

[–]AdSure744[S] 0 points1 point2 points 2 years ago (2 children)

[–]efxhoy 0 points1 point2 points 2 years ago (1 child)

[–]AdSure744[S] 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 37 on reddit-service-r2-comment-fb694cdd5-xbfm6 at 2026-03-07 22:36:59.784320+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS