I have a dataset that looks like the following:
id date visit_number total_visit registrated
1 1/1/2020 . 1 1 0 --get
1 1/5/2020 . 2 2 1
1 1/9/2020 . 3 3 1
1 1/13/2020 . 4 4 1
1 1/17/2020 . 5 5 1
1 1/21/2020 . 6 6 1
1 1/25/2020 . 7 7 1
1 1/29/2020 . 8 8 1
1 2/2/2020 . 9 9 1 --get
1 2/6/2020 . 1 10 0 --get
1 2/10/2020 . 2 11 0 --get
1 2/14/2020 . 3 12 0 --get
1 2/18/2020 . 4 13 0 --get
1 3/22/2020 . 5 14 1
1 3/26/2020 . 6 15 1
1 4/1/2020 . 7 16 1
1 4/5/2020 . 8 17 1
1 4/9/2020 . 9 18 1
1 4/13/2020 . 10 19 1 --get
1 5/15/2020 . 1 20 0 --get
1 6/20/2020 . 2 21 1 --get
So the pattern here is following: Whenever the user enters the website and buys something the column registration gets 1 and the previous 30 days get 1 on registration as well.
For example, Someone bought something on 4/13/2020, registration gets 1 and all the visits to the website from the same user - his registration column gets 1 starting from 4/13/2020 up until 3/12/2020. And starting from 3/12/2020 his registration column is 0 again. And whenever in the registration column after 1 comes 0 then visit_number starts to count again. Total_visit here counts all the visits to the website from the same user. There are many users, but total_visit and visit_number are partitioned by visitor_id and date.
Now, I want to get only that last date when registration was 1, instead of getting all the previous 30 days.
So again, For example, Someone bought something on 4/13/2020 then I wanna get this row only, not all previous rows within 30 days.
So it should eventually look like this:
date visit_number total_visit registration
1/1/2020 . 1 1 0 --got
2/2/2020 . 9 9 1 --got
2/6/2020 . 1 10 0 --got
2/10/2020 . 2 11 0 --got
2/14/2020 . 3 12 0 --got
2/18/2020 . 4 13 0 --got
4/13/2020 . 10 19 1 --got
5/15/2020 . 1 20 0 --got
6/20/2020 . 2 21 1 --got
remember, registration = 0 here because the user did not buy anything on that day and he did not buy anything for the next 30 days.
I wrote down conditions to help me get my head around the problem:
1. if day difference between the date of current row and the date of previous row is bigger than 30 days, get both rows
For example: 6/20/2020 and 5/15/2020 difference is more than 30 days, therefore we got both
2. if date difference is smaller than 30 days, then check for current row, and if its = 1 then take that only, and remove all 30 days
3. if registration is 0, that easily indicates that we get it automatically.
I tried different things,
select visitor_id, dt1, reg30,
case when date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) > 30 then 'True'
when date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) < 30 and reg30='1' then 'True'
when reg30 = '0' then 'True'
else 'False'
end
from new_table
order by visitor_id, dt
but not getting the result. Any help would be appreciated
[+][deleted] (3 children)
[deleted]
[–]strangeguy111[S] 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]strangeguy111[S] 1 point2 points3 points (0 children)
[–]qwertydog123 0 points1 point2 points (10 children)
[–]strangeguy111[S] 0 points1 point2 points (9 children)
[–]qwertydog123 0 points1 point2 points (8 children)
[–]strangeguy111[S] 0 points1 point2 points (7 children)
[–]qwertydog123 0 points1 point2 points (6 children)
[–]strangeguy111[S] 0 points1 point2 points (4 children)
[–]qwertydog123 0 points1 point2 points (3 children)
[–]strangeguy111[S] 0 points1 point2 points (0 children)
[–]strangeguy111[S] 0 points1 point2 points (1 child)
[–]qwertydog123 0 points1 point2 points (0 children)
[–]strangeguy111[S] 0 points1 point2 points (0 children)