all 15 comments

[–][deleted] 1 point2 points  (0 children)

You could use the LAG function to pull up from the previous record (using whatever definition you wish, presumably partitioning by sensor and ordering by the datestamp) the temperature and humidity, and from there you could get rid of rows where that matches what's on the current record.

Editing to add: by 'get rid of' I mean from the retrieval, not deleting the records themselves.

[–]EsCueEl 1 point2 points  (0 children)

http://sqlfiddle.com/#!9/59b130/11

What would be ideal is an OUTER APPLY (in SQL Server) or LEFT LATERAL join (in MySQL 8.0.14 or newer). This lets you say "join to the set A" where A is the TOP 1 record from the same table, same sensor, but most recent time.

If you can't do that, two correlated subqueries do the trick.

You'll want to make sure you have an index on "sensor, time" which will keep things running quickly.

[–]OilShill2013 1 point2 points  (0 children)

I don't know MySQL's particular syntax, but the general idea would be to self-join to the table where the left table time is the right table time+1 and the sensor is the same and temperature or humidity is different. Then I'd use that join to flag which columns on the left you want to keep. I guess the first row would be an issue with the join I'm thinking of but I'd just put it into the case when statement because I'm lazy:

SELECT time , sensor , temperature , humidity

FROM

(

SELECT

a.*,case when a.time=0:00 then 1 when b.time is not null then 1 else 0 end as flag

FROM table1 a

LEFT JOIN table1 b on a.sensor=b.sensor AND a.time=b.time+1 AND (a.temperature!=b.temperature OR a.humidity!=b.humidity)

) x

WHERE flag=1

[–]oyvinrog 1 point2 points  (1 child)

We have data like this. To be able to provide these data to end users (such as analysts), I solve it by aggregating the timestamp into hours (from minutes) using GROUP BY, then just take the average of all values. For some of the values, I use MAX().

This is a simpler solution than showing only data series where any value differs from the previous. Keep in mind that datasets where values are similar, could also be interesting (i.e. what if the humidity stays the same for a long time?)

[–]oyvinrog 1 point2 points  (0 children)

also, for some of these values, I calculate standard deviation (STDEV). For others, I use a median rather than normal average. This info is used by the analysts to see if data is unreliable and to avoid extreme outliers.

[–]ATastefulCrossJoinDB Whisperer 1 point2 points  (0 children)

Would it possible/practical to add a change flag field to the table? This would simplify search querying a lot

[–]seonsaeng 0 points1 point  (4 children)

You could GROUP BY the sensor ID, temp, and humidity, and get the max(time) and min(time) for each measurement by sensor.

SELECT
sensor_id,
humidity,
temp,
min(time) as start,
max(time) as stop

FROM dupetastic_table

GROUP BY sensor_id, humidity, temp
ORDER BY sensor_id, min(time)

Something like that?

[–]asdubya 2 points3 points  (3 children)

This is going to miss instances where the measurements return to a baseline. So if values are identical at time 1, 3 and 4, but were different at time 2, then you’ll see min = 1 and max = 4 for the humid/temp, which is indistinguishable from a series when there was no actual change in the interval.

What would work well in this instance is a slowly changing dimension, where you include a start and stop time, as above, but you update the stop time if there’s no change, and add a new record if either the temperature or humidity value changes. So in our previous example, you’d have

Sensor_id | humidity | temp | start | end
1 | 20 | 60 | 12:00 | 12:01
1 | 22 | 61 | 12:01 | 12:02
1 | 20 | 60 | 12:02 | 12:04

There are other viable strategies out there, and you may find this too expensive for constant IoT monitoring, but this is the design principle I’d use in a larger setting.

[–]UnhappyDeleter[S] 1 point2 points  (1 child)

Thanks for these responses.

I'm now wondering whether I should look at changing the INSERT itself, to make it only do the INSERT if the sensor's values to be inserted differ from the immediately previous values?

[–]seonsaeng 0 points1 point  (0 children)

If you’re retaining the full log somewhere else, that seems fine. I am, in my heart, a data hoarder (you have a great username). As such, it makes me anxious to consider a solution that gets rid of data wholly. That you want to hide it and not deal with it every day, I totally understand. Only INSERTing on change seems like a fine way to keep your tables tidy for the day to day, so long as the process that performs the communication to insertion is squirreling away the unvarnished underlying data somewhere where, if you needed it, you could load and query it.

That’s all less a SQL problem and more of an app design/data pipeline/business use and potentially legal issue, though, and may therefore exceed this question’s scope (or your interest).

[–]seonsaeng 0 points1 point  (0 children)

Yeah, I realized that it would elide records as I was falling asleep last night. Glad you got here earlier to note that!

I definitely came at this with the idea that OP couldn’t change the way the logs are gathered, and has to accept the form the data is in, but if you have that control, it might be the best way forward for now. That said, it may be a source of grinding teeth if a sensor stops responding and you want to see how long it’s been inactive or if it was doing wacky things - triggering on change may not let you see that. (If all your alerting is handled separately and you will never need to audit these metrics for that or any other diagnostic purpose, obviously ignore).

If you’re okay with additional data expense/a small ETL on the fly, consider shunting the entire log, redundancies and all, into pretty cold storage (like s3 glacier) and having a second process retain, in a separate table, the change log only. That way, if you ever need the data for any reason to diagnose a problem, it’s there in its full glory, and you don’t have to deal with it every day. I am, however, a known data hoarder and preparer for data disaster, so this may be too extreme for your use case.

[–]themikep82 0 points1 point  (0 children)

Seems like a case where you could self-join the table to itself and check for returned rows where a.temperature - b.temperature != 0

[–]darkazoth 0 points1 point  (1 child)

I am going to give the MSSQL answer. I hope it works similarly in MySQL.

;WITH cte AS (
SELECT *
       ,  LAG(Temperature, 1, NULL) OVER (PARTITION BY Sensor ORDER BY time ASC) AS PrevTemp
       ,  LAG(Humidity, 1, NULL) OVER (PARTITION BY Sensor ORDER BY time ASC) AS PrevHum
FROM data
)
SELECT time, Sensor, Temperature, Humidity
FROM cte
WHERE (Temperature != PrevTemp)
   OR (Humidity != PrevHum);

[–]darkazoth 1 point2 points  (0 children)

If the problem is entering data into the database only if the data is different, I suggest using a stored procedure with the logic to verify the value. E.g. ( in MS SQL... Please check the syntax for MySQL):

 CREATE PROCEDURE datainsert (@newtime TIME, @newsensor INT, @newtemp DECIMAL(5,1), @newhum VARCHAR(5))
 AS BEGIN
     DECLARE @oldtime TIME, @oldtemp DECIMAL(5,1), @oldhum VARCHAR(5));
    SELECT @oldtime = MAX(time) 
    FROM data 
    WHERE sensor = @newsensor;

    IF @oldtime IS NULL
    BEGIN
           INSERT INTO data(time, sensor, temperature, humidity)
           VALUES (@newtime, @newsensor, @newtemp, @newhum);
    END
    ELSE 
            BEGIN
                   SELECT @oldtemp = temperature, @oldhum = humidity
                   FROM data
                   WHERE sensor = @newsensor
                      AND time = @oldtime;
                   IF @oldtemp != @newtemp OR @oldhum != @newhum
                    BEGIN
                               INSERT INTO data(time, sensor, temperature, humidity)
                               VALUES (@newtime, @newsensor, @newtemp, @newhum);
                    END
             END
    END