[ACCEPTED]-SQL Query to Collapse Duplicate Values By Date Range-oracle

Accepted answer
Score: 58

I'm going to develop my solution incrementally, decomposing 96 each transformation into a view. This both 95 helps explain what's being done, and helps 94 in debugging and testing. It's essentially 93 applying the principle of functional decomposition 92 to database queries.

I'm also going to do 91 it without using Oracle extensions, with 90 SQL that ought to run on any modern RBDMS. So 89 no keep, over, partition, just subqueries 88 and group bys. (Inform me in the comments 87 if it doesn't work on your RDBMS.)

First, the 86 table, which since I'm uncreative, I'll 85 call month_value. Since the id is not actually 84 a unique id, I'll call it "eid". The other 83 columns are "m"onth, "y"ear, and "v"alue:

create table month_value( 
   eid int not null, m int, y int,  v int );

After 82 inserting the data, for two eids, I have:

> select * from month_value;
+-----+------+------+------+
| eid | m    | y    | v    |
+-----+------+------+------+
| 100 |    1 | 2008 |   80 |
| 100 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |   80 |
| 200 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |   80 |
+-----+------+------+------+
8 rows in set (0.00 sec)

Next, we 81 have one entity, the month, that's represented 80 as two variables. That should really be 79 one column (either a date or a datetime, or 78 maybe even a foreign key to a table of dates), so 77 we'll make it one column. We'll do that 76 as a linear transform, such that it sorts 75 the same as (y, m), and such that for any 74 (y,m) tuple there is one and only value, and 73 all values are consecutive:

> create view cm_abs_month as 
select *, y * 12 + m as am from month_value;

That gives us:

> select * from cm_abs_month;
+-----+------+------+------+-------+
| eid | m    | y    | v    | am    |
+-----+------+------+------+-------+
| 100 |    1 | 2008 |   80 | 24097 |
| 100 |    2 | 2008 |   80 | 24098 |
| 100 |    3 | 2008 |   90 | 24099 |
| 100 |    4 | 2008 |   80 | 24100 |
| 200 |    1 | 2008 |   80 | 24097 |
| 200 |    2 | 2008 |   80 | 24098 |
| 200 |    3 | 2008 |   90 | 24099 |
| 200 |    4 | 2008 |   80 | 24100 |
+-----+------+------+------+-------+
8 rows in set (0.00 sec)

Now 72 we'll use a self-join in a correlated subquery 71 to find, for each row, the earliest successor 70 month in which the value changes. We'll 69 base this view on the previous view we created:

> create view cm_last_am as 
   select a.*, 
    ( select min(b.am) from cm_abs_month b 
      where b.eid = a.eid and b.am > a.am and b.v <> a.v) 
   as last_am 
   from cm_abs_month a;

> select * from cm_last_am;
+-----+------+------+------+-------+---------+
| eid | m    | y    | v    | am    | last_am |
+-----+------+------+------+-------+---------+
| 100 |    1 | 2008 |   80 | 24097 |   24099 |
| 100 |    2 | 2008 |   80 | 24098 |   24099 |
| 100 |    3 | 2008 |   90 | 24099 |   24100 |
| 100 |    4 | 2008 |   80 | 24100 |    NULL |
| 200 |    1 | 2008 |   80 | 24097 |   24099 |
| 200 |    2 | 2008 |   80 | 24098 |   24099 |
| 200 |    3 | 2008 |   90 | 24099 |   24100 |
| 200 |    4 | 2008 |   80 | 24100 |    NULL |
+-----+------+------+------+-------+---------+
8 rows in set (0.01 sec)

last_am 68 is now the "absolute month" of the first 67 (earliest) month (after the month of the 66 current row) in which the value, v, changes. It's 65 null where there is no later month, for 64 that eid, in the table.

Since last_am is 63 the same for all months leading up to the 62 change in v (which occurs at last_am), we 61 can group on last_am and v (and eid, of 60 course), and in any group, the min(am) is 59 the absolute month of the first consecutive month 58 that had that value:

> create view cm_result_data as 
  select eid, min(am) as am , last_am, v 
  from cm_last_am group by eid, last_am, v;

> select * from cm_result_data;
+-----+-------+---------+------+
| eid | am    | last_am | v    |
+-----+-------+---------+------+
| 100 | 24100 |    NULL |   80 |
| 100 | 24097 |   24099 |   80 |
| 100 | 24099 |   24100 |   90 |
| 200 | 24100 |    NULL |   80 |
| 200 | 24097 |   24099 |   80 |
| 200 | 24099 |   24100 |   90 |
+-----+-------+---------+------+
6 rows in set (0.00 sec)

Now this is the result 57 set we want, which is why this view is called 56 cm_result_data. All that's lacking is something 55 to transform absolute months back to (y,m) tuples.

To 54 do that, we'll just join to the table month_value.

There 53 are only two problems: 1) we want the month 52 before last_am in our output, and 2) we have nulls 51 where there is no next month in our data; to 50 met the OP's specification, those should 49 be single month ranges.

EDIT: These could 48 actually be longer ranges than one month, but 47 in every case they mean we need to find 46 the latest month for the eid, which is:

(select max(am) from cm_abs_month d where d.eid = a.eid )

Because 45 the views decompose the problem, we could 44 add in this "end cap" month earlier, by 43 adding another view, but I'll just insert 42 this into the coalesce. Which would be most 41 efficient depends on how your RDBMS optimizes 40 queries.

To get month before, we'll join 39 (cm_result_data.last_am - 1 = cm_abs_month.am)

Wherever 38 we have a null, the OP wants the "to" month 37 to be the same as the "from" month, so we'll 36 just use coalesce on that: coalesce( last_am, am). Since 35 last eliminates any nulls, our joins don't 34 need to be outer joins.

> select a.eid, b.m, b.y, c.m, c.y, a.v 
   from cm_result_data a 
    join cm_abs_month b 
      on ( a.eid = b.eid and a.am = b.am)  
    join cm_abs_month c 
      on ( a.eid = c.eid and 
      coalesce( a.last_am - 1, 
              (select max(am) from cm_abs_month d where d.eid = a.eid )
      ) = c.am)
    order by 1, 3, 2, 5, 4;
+-----+------+------+------+------+------+
| eid | m    | y    | m    | y    | v    |
+-----+------+------+------+------+------+
| 100 |    1 | 2008 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |    4 | 2008 |   80 |
+-----+------+------+------+------+------+

By joining back we 33 get the output the OP wants.

Not that we 32 have to join back. As it happens, our absolute_month 31 function is bi-directional, so we can just 30 recalculate the year and offset month from 29 it.

First, lets take care of adding the "end 28 cap" month:

> create or replace view cm_capped_result as 
select eid, am, 
  coalesce( 
   last_am - 1, 
   (select max(b.am) from cm_abs_month b where b.eid = a.eid)
  ) as last_am, v  
 from cm_result_data a;

And now we get the data, formatted 27 per the OP:

select eid, 
 ( (am - 1) % 12 ) + 1 as sm, 
 floor( ( am - 1 ) / 12 ) as sy, 
 ( (last_am - 1) % 12 ) + 1 as em, 
 floor( ( last_am - 1 ) / 12 ) as ey, v    
from cm_capped_result 
order by 1, 3, 2, 5, 4;

+-----+------+------+------+------+------+
| eid | sm   | sy   | em   | ey   | v    |
+-----+------+------+------+------+------+
| 100 |    1 | 2008 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |    4 | 2008 |   80 |
+-----+------+------+------+------+------+

And there's the data the OP wants. All 26 in SQL that should run on any RDBMS, and 25 is decomposed into simple, easy to understand 24 and easy to test views.

Is is better to rejoin 23 or to recalculate? I'll leave that (it's 22 a trick question) to the reader.

(If your 21 RDBMS doesn't allow group bys in views, you'll 20 have to join first and then group, or group 19 and then pull in the month and year with 18 correlated subqueries. This is left as an 17 exercise for the reader.)


Jonathan Leffler 16 asks in the comments,

What happens with 15 your query if there are gaps in the data 14 (say there's an entry for 2007-12 with 13 value 80, and another for 2007-10, but 12 not one for 2007-11? The question isn't 11 clear what should happen there.

Well, you're 10 exactly right, the OP doesn't specify. Perhaps 9 there's an (unmentioned) pre-condition that 8 there are no gaps. In the absence of a requirement, we 7 shouldn't try to code around something that 6 might not be there. But, the fact is, gaps 5 make the "joining back" strategy fail; the 4 "recalculate" strategy doesn't fail under 3 those conditions. I'd say more, but that 2 would reveal the trick in the trick question 1 I alluded to above.

Score: 1

I got it to work as follows. It is heavy 20 on analytic functions and is Oracle specific.

select distinct id, value,
decode(startMonth, null,
  lag(startMonth) over(partition by id, value order by startMonth, endMonth),  --if start is null, it's an end so take from the row before
startMonth) startMonth,

  decode(endMonth, null,
  lead(endMonth) over(partition by id, value order by startMonth, endMonth),  --if end is null, it's an start so take from the row after
endMonth) endMonth    

from (
select id, value, startMonth, endMonth from(
select id, value, 
decode(month+1, lead(month) over(partition by id,value order by month), null, month)     
startMonth, --get the beginning month for each interval
decode(month-1, lag(month) over(partition by id,value order by month), null, month)     
endMonth --get the end month for each interval from Tbl
) a 
where startMonth is not null or endMonth is not null --remain with start and ends only
)b

It 19 might be possible to simplify some of the 18 inner queries somewhat

The inner query checks 17 if the month is a first/last month of the 16 interval as follows: if the month + 1 == the 15 next month (lag) for that grouping, then 14 since there is a next month, this month 13 is obviously not the end month. Otherwise, it 12 is the last month of the interval. The same 11 concept is used to check for the first month.

The 10 outer query first filters out all rows that 9 are not either start or end months (where startMonth is not null or endMonth is not null). Then, each 8 row is either a start month or an end month 7 (or both), determined by whether start or 6 end is not null). If the month is a start 5 month, get the corresponding end month by 4 getting the next (lead) endMonth for that 3 id,value ordered by endMonth, and if it 2 is an endMonth get the startMonth by looking 1 for the previous startMonth (lag)

Score: 1

This one uses only one table scan and works 3 across years. It's better though to model 2 your month and year column as only one date 1 datatype column:

SQL> create table tbl (id,month,year,value)
  2  as
  3  select 100,12,2007,80 from dual union all
  4  select 100,1,2008,80 from dual union all
  5  select 100,2,2008,80 from dual union all
  6  select 100,3,2008,90 from dual union all
  7  select 100,4,2008,80 from dual union all
  8  select 200,12,2007,50 from dual union all
  9  select 200,1,2008,50 from dual union all
 10  select 200,2,2008,40 from dual union all
 11  select 200,3,2008,50 from dual union all
 12  select 200,4,2008,50 from dual union all
 13  select 200,5,2008,50 from dual
 14  /

Tabel is aangemaakt.

SQL> select id
  2       , mod(min(year*12+month-1),12)+1 startmonth
  3       , trunc(min(year*12+month-1)/12) startyear
  4       , mod(max(year*12+month-1),12)+1 endmonth
  5       , trunc(max(year*12+month-1)/12) endyear
  6       , value
  7    from ( select id
  8                , month
  9                , year
 10                , value
 11                , max(rn) over (partition by id order by year,month) maxrn
 12             from ( select id
 13                         , month
 14                         , year
 15                         , value
 16                         , case lag(value) over (partition by id order by year,month)
 17                           when value then null
 18                           else rownum
 19                           end rn
 20                      from tbl
 21                  ) inner
 22         )
 23   group by id
 24       , maxrn
 25       , value
 26   order by id
 27       , startyear
 28       , startmonth
 29  /

        ID STARTMONTH  STARTYEAR   ENDMONTH    ENDYEAR      VALUE
---------- ---------- ---------- ---------- ---------- ----------
       100         12       2007          2       2008         80
       100          3       2008          3       2008         90
       100          4       2008          4       2008         80
       200         12       2007          1       2008         50
       200          2       2008          2       2008         40
       200          3       2008          5       2008         50

6 rijen zijn geselecteerd.

Regards, Rob.

Score: 0

I couldn't get the response from ngz to 19 work when the input table contains multiple 18 ids and date ranges that span years. I 17 have a solution that does work, but with 16 qualifications. It will only give you the 15 correct answers if you know that you have 14 a row for every month/year/id combination 13 within the range. If there are "holes" it 12 won't work. If you have holes, I know of 11 know good way to do it other than writing 10 some PL/SQL and using a cursor loop to create 9 a new table in the format you want.

By the 8 way, this is why data modeled this way is 7 an abomination. You should always store 6 stuff as start/from range records, not as 5 discrete time period records. It's trivial 4 to transform the former into the latter 3 with a "multiplier" table, but it's almost 2 impossible (as you've seen) to go the other 1 direction.

SELECT ID
     , VALUE
     , start_date
     , end_date
  FROM (SELECT ID
             , VALUE
             , start_date
             , CASE
                  WHEN is_last = 0
                     THEN LEAD(end_date) OVER(PARTITION BY ID ORDER BY start_date)
                  ELSE end_date
               END end_date
             , is_first
          FROM (SELECT ID
                     , VALUE
                     , TO_CHAR(the_date, 'YYYY.MM') start_date
                     , TO_CHAR(NVL(LEAD(the_date - 31) OVER(PARTITION BY ID ORDER BY YEAR
                                  , MONTH), the_date), 'YYYY.MM') end_date
                     , is_first
                     , is_last
                  FROM (SELECT ID
                             , YEAR
                             , MONTH
                             , TO_DATE(TO_CHAR(YEAR) || '.' || TO_CHAR(MONTH) || '.' || '15', 'YYYY.MM.DD') the_date
                             , VALUE
                             , ABS(SIGN(VALUE -(NVL(LAG(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                   , MONTH), VALUE - 1)))) is_first
                             , ABS(SIGN(VALUE -(NVL(LEAD(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                   , MONTH), VALUE - 1)))) is_last
                          FROM test_table)
                 WHERE is_first = 1
                    OR is_last = 1))
 WHERE is_first = 1

More Related questions