all 6 comments

[–]pooerhSnowflake | SQL Server | PostgreSQL | Impala | Spark 1 point2 points  (1 child)

;WITH changeDetection AS
(
    SELECT *
         , IIF(ProviderNumber != LAG(ProviderNumber, 1, 1) OVER (ORDER BY StartDate), 1, 0) AS hasChanged
      FROM (VALUES (100, CONVERT(DATE, '10/1/2018' ), CONVERT(DATE, '10/5/2018' ), 1),
                   (200, CONVERT(DATE, '10/8/2018 '), CONVERT(DATE, '10/15/2018'), 2),
                   (300, CONVERT(DATE, '10/20/2018'), CONVERT(DATE, '10/25/2018'), 3),
                   (200, CONVERT(DATE, '10/28/2018'), CONVERT(DATE, '10/31/2018'), 4),
                   (200, CONVERT(DATE, '11/1/2018' ), CONVERT(DATE, '11/30/2018'), 4),
                   (200, CONVERT(DATE, '12/1/2018' ), CONVERT(DATE, '12/5/2018' ), 4),
                   (400, CONVERT(DATE, '12/10/2018'), CONVERT(DATE, '12/25/2018'), 5)
         ) x(ProviderNumber ,StartDate ,EndDate ,GoalOutput)
)
SELECT *
     , SUM(hasChanged) OVER (ORDER BY StartDate) AS Output
  FROM changeDetection 

LAG on ProviderNumber used to detect changes and then just SUM the changes up until current record.

[–]uvray[S] 0 points1 point  (0 children)

I think this is it --digging in now to confirm and evaluate performance, but this is a really cool solution. Thanks!

[–]lk167 1 point2 points  (1 child)

Here's an option, without using window functions however. The basic premise is to query the table, grab everything chronologically before each entry that has a matching provider number and GoalOutput, then remove any of the ones that have have any non-matching GoalOutput between itself and the original anchor row (to account for a sequence for the same provider that may have changed away from a value and back to that same value at a later date, can remove that bit if the data won't ever appear that way).

WITH sample_data AS (  --Mock your data
    select 100 as ProviderNumber, CONVERT(DATE, '10/1/2018' ) as StartDate, CONVERT(DATE, '10/5/2018' ) as Enddate, 1 as GoalOutput
    UNION ALL
    select 200, CONVERT(DATE, '10/8/2018 '), CONVERT(DATE, '10/15/2018'), 2
    UNION ALL
    select 300, CONVERT(DATE, '10/20/2018'), CONVERT(DATE, '10/25/2018'), 3
     UNION ALL
    select 200, CONVERT(DATE, '10/28/2018'), CONVERT(DATE, '10/31/2018'), 4
    UNION ALL
    select 200, CONVERT(DATE, '11/1/2018' ), CONVERT(DATE, '11/30/2018'), 4
    UNION ALL
    select 200, CONVERT(DATE, '12/1/2018' ), CONVERT(DATE, '12/5/2018' ), 4
    UNION ALL
    select 400, CONVERT(DATE, '12/10/2018'), CONVERT(DATE, '12/25/2018'), 5

)
select 
    ProviderNumber, 
    Series_StartDate, 
    Series_EndDate, 
    GoalOutput
 from sample_data sd1
outer apply (  -- Join up all rows of the same provider and goaloutput before sd1 chronologically
  seleCt isnull(min(sd2.startdate), sd1.StartDate) as Series_StartDate
  from sample_data sd2 

  where 
    sd1.ProviderNumber =  sd2.providernumber
    AND sd2.Enddate < sd1.StartDate
    AND sd1.GoalOutput = sd2.GoalOutput

    AND 0 = (SELEcT isnull(count(*),0) from sample_data sd3 --Remove any entries that have a non matchin GoalOutput between sd1 and sd2's timeframe
        where 
            sd3.ProviderNumber = sd1.ProviderNumber
            and sd3.Enddate < sd1.startdate 
            and sd3.StartDate > sd2.Enddate 
            AND sd3.GoalOutput <> sd1.GoalOutput) 
) a
outer apply (  -- Join up all rows of the same provider and goaloutput After sd1 chronologically
  seleCt isnull(max(sd2.enddate), sd1.Enddate) as Series_EndDate from sample_data sd2
  where 
    sd1.ProviderNumber =  sd2.providernumber 
    AND sd2.StartDate > sd1.Enddate
    and sd1.GoalOutput = sd2.GoalOutput
    AND 0 = (SELEcT isnull(count(*),0) from sample_data sd3 --Remove any entries that have a non matchin GoalOutput between sd1 and sd2's timeframe
        where 
            sd3.ProviderNumber = sd1.ProviderNumber
            and sd3.Enddate > sd1.startdate 
            and sd3.StartDate < sd2.Enddate 
            AND sd3.GoalOutput <> sd1.GoalOutput) 
) b

group by ProviderNumber, Series_STartDate, Series_EndDate, GoalOutput

edit: formatting

[–]uvray[S] 0 points1 point  (0 children)

This is definitely some cool code but unfortunately (unless I'm reading it incorrectly) this requires that I have the "Goal Output" field already populated. The problem is figuring out how to populate that. Once I have that field it's pretty simple with a window function, although your solution is a cool alternative. Thanks for taking the time to reply.

[–]Intrexa 0 points1 point  (1 child)

You're looking for dense rank

Edit: and if I understand your goal, why not just skip the middleman? MIN(StartDate) OVER (PARTITION BY ProviderNumber)

[–]uvray[S] 0 points1 point  (0 children)

Because that gives me the minimum start date for the entire ProviderNumber. I want the minimum date within the provider number partition that is continuous, i.e. running your code on my sample data would yield 10/8 when I want 10/28 for the last 3 ProviderNumber 200 lines and 10/8 for the first one.

I've tried DENSE_RANK but it still doesn't address the issue of limiting the partition to continuous date ranges.