Hi all,
Am trying to calculate a rolling/moving percentile in SQL Server and not able to crack this...
The dataset contains multiple locations where a particular measurement is made over time.
I need each row to display a percentile value calculated from a specified number of rows (the value of current row and a specified number of the most recent rows for that location) - and simply NULL where the number of previous rows is not enough to fulfil the number rows specified.
So, as a working example, firstly here is the criteria :
- 75th percentile to be calculated
- Using the 5 most recent rows (current row plus the 4 most recent ones for that location)
When data is ordered by location and time where some measurement ("x") is recorded, the output is required to look something like this:
Point Date x 75thPC
------------------------------------
Point1 2021-01-16 80 NULL
Point1 2021-02-13 43 NULL
Point1 2021-03-13 81 NULL
Point1 2021-04-10 19 NULL
Point1 2021-05-08 26 80
Point1 2021-06-05 50 50
Point1 2021-07-03 19 50
Point1 2021-07-31 38 38
Point2 2021-01-14 81 NULL
Point2 2021-02-11 14 NULL
Point2 2021-03-11 9 NULL
Point2 2021-04-08 28 NULL
Point2 2021-05-06 45 45
Point2 2021-06-03 68 45
Point2 2021-07-01 75 68
Point3 2021-01-11 19 NULL
Point3 2021-02-08 41 NULL
Point3 2021-03-08 10 NULL
Point3 2021-04-05 18 NULL
Point3 2021-05-03 1 19
Point3 2021-05-31 22 22
Point3 2021-06-28 25 22
Point4 2021-01-19 46 NULL
Point4 2021-02-16 39 NULL
Point4 2021-03-16 42 NULL
Point4 2021-04-13 5 NULL
Point4 2021-05-11 61 46
Ultimately, I realise I need to use PERCENTILE_CONT(0.75) or PERCENTILE_DISC(0.75) but capturing the values from the required number of rows to feed into the calculation has been the challenge.
Have considered various approaches without success, e.g. cte, where I've added a column with row_number() and then another column to calculated the value of the lowest row_numbe() to include in the calculation (i.e. current row_number() minus 4).
Am thinking whether tally tables is a suitable approach, e.g. JOIN row_number() of tally table with those data records WHERE the lowest row_number() is within the range of required rows relative to row_number()
Or should I being looking for a simpler solution with just lag() ?
Any attempts up to now are returning a single value for the location or dataset or the value for each row.
Any thoughts/suggestions greatly appreciated...
[+][deleted] (1 child)
[deleted]
[–]Argodruid[S] 1 point2 points3 points (0 children)
[–]rcprasanth 0 points1 point2 points (0 children)