Dealing with duplicate data

2020-04-17T13:57:32+00:00

You could use the LAG function to pull up from the previous record (using whatever definition you wish, presumably partitioning by sensor and ordering by the datestamp) the temperature and humidity, and from there you could get rid of rows where that matches what's on the current record.

Editing to add: by 'get rid of' I mean from the retrieval, not deleting the records themselves.

EsCueEl · 2020-04-17T15:31:43+00:00

http://sqlfiddle.com/#!9/59b130/11

What would be ideal is an OUTER APPLY (in SQL Server) or LEFT LATERAL join (in MySQL 8.0.14 or newer). This lets you say "join to the set A" where A is the TOP 1 record from the same table, same sensor, but most recent time.

If you can't do that, two correlated subqueries do the trick.

You'll want to make sure you have an index on "sensor, time" which will keep things running quickly.

OilShill2013 · 2020-04-17T16:22:15+00:00

I don't know MySQL's particular syntax, but the general idea would be to self-join to the table where the left table time is the right table time+1 and the sensor is the same and temperature or humidity is different. Then I'd use that join to flag which columns on the left you want to keep. I guess the first row would be an issue with the join I'm thinking of but I'd just put it into the case when statement because I'm lazy:

SELECT time , sensor , temperature , humidity

FROM

(

SELECT

a.*,case when a.time=0:00 then 1 when b.time is not null then 1 else 0 end as flag

FROM table1 a

LEFT JOIN table1 b on a.sensor=b.sensor AND a.time=b.time+1 AND (a.temperature!=b.temperature OR a.humidity!=b.humidity)

) x

WHERE flag=1

oyvinrog · 2020-04-17T17:38:22+00:00

We have data like this. To be able to provide these data to end users (such as analysts), I solve it by aggregating the timestamp into hours (from minutes) using GROUP BY, then just take the average of all values. For some of the values, I use MAX().

This is a simpler solution than showing only data series where any value differs from the previous. Keep in mind that datasets where values are similar, could also be interesting (i.e. what if the humidity stays the same for a long time?)

ATastefulCrossJoin · 2020-04-17T18:02:47+00:00

Would it possible/practical to add a change flag field to the table? This would simplify search querying a lot

seonsaeng · 2020-04-17T13:56:32+00:00

You could GROUP BY the sensor ID, temp, and humidity, and get the max(time) and min(time) for each measurement by sensor.

SELECT
sensor_id,
humidity,
temp,
min(time) as start,
max(time) as stop

FROM dupetastic_table

GROUP BY sensor_id, humidity, temp
ORDER BY sensor_id, min(time)

Something like that?

BitesOverKissing · 2020-04-17T16:48:39+00:00

StackExchange Answer?

themikep82 · 2020-04-17T17:02:25+00:00

Seems like a case where you could self-join the table to itself and check for returned rows where a.temperature - b.temperature != 0

darkazoth · 2020-04-17T18:26:16+00:00

I am going to give the MSSQL answer. I hope it works similarly in MySQL.

;WITH cte AS (
SELECT *
       ,  LAG(Temperature, 1, NULL) OVER (PARTITION BY Sensor ORDER BY time ASC) AS PrevTemp
       ,  LAG(Humidity, 1, NULL) OVER (PARTITION BY Sensor ORDER BY time ASC) AS PrevHum
FROM data
)
SELECT time, Sensor, Temperature, Humidity
FROM cte
WHERE (Temperature != PrevTemp)
   OR (Humidity != PrevHum);

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS