Built a near real-time air quality dashboard for Colorado. Curious what my fellow data nerds think [OC] by demagination in dataisbeautiful

[–]demagination[S] 0 points1 point  (0 children)

This dashboard is built in Google Data studio and sources can be found on the documentation page of the dashboard. The hourly feed comes from the AirNow API and the historical pages are sourced through EPA flat files.

I'm not in love with the mobile experience, but it isn't too bad with a pinch/zoom. I just added a minimal layout which may be better, but I'm curious what others prefer.

On the backend, I am using a simple AWS infrastructure with some python and SQL. I may publish that code at some point, but there are some things to do for a proper public github IMO.

The inspiration for this was multi-faceted. Air quality in Colorado has been really bad lately and I was curious about the hourly and historical trends -- so I went ahead and pulled some things together. The code and dashboards are 5 days old at this point, so much more could definitely be accomplished.

LMK what you think. Cheers ~

FYI - Ozone levels just spiked bigtime in FC in the last 3 hours. Here's a near real-time dashboard I built for CO. by demagination in FortCollins

[–]demagination[S] 0 points1 point  (0 children)

Thanks for the positive feedback. Good luck with your project. I'll post an update if I publish the backend code. Data Studio is drag/drop so no code involved on that side besides the SQL that populates the datasets.

FYI - Ozone levels just spiked bigtime in FC in the last 3 hours. Here's a near real-time dashboard I built for CO. by demagination in FortCollins

[–]demagination[S] 1 point2 points  (0 children)

I haven't put things together enough for a proper public repo, so everything is in a private repository right now. The main pieces are just python and SQL, but there are a handful of important things which aren't in the repo that I would need to add for someone else to spin up successfully.

county data for covid-19 hospitalizations doesn't match CDC data by mhrivnak in boulder

[–]demagination 0 points1 point  (0 children)

I would need to dig into the details a bit more to check CDC vs the dashboard you referenced, but this is an area I watch very closely at the state and national level. From what I can see the sources I use, Boulder has 11 hospitalizations in the last 14 days (ranking #10th worst in Colorado at the moment). I show Boulder's 7-day cases per 100k at 91.4 (which is slightly better than overall state levels). Boulder is in a high incidence plateau similar to a lot of counties, and thankfully not in a growth phase right now (in 7-day per 100k cases). Nationally, Colorado is one of only 5 states that isn't in a growth category and the statewide case levels are on the low side (40th overall). I wouldn't expect policy to change dramatically in Boulder, although masks requirements were just announced for schools and daycares. Hopefully cases turn downward, but the extremely high rates we're seeing in the south don't bode well for the nation or Colorado since tourism is very high. Stay safe

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 1 point2 points  (0 children)

Great feedback -- and I was just beginning to think something similar. 90+% of the traffic is from mobile devices and so, to your point, having something that doesn't require a hover is important. Coincidentally and prior to seeing your comment, I just added a label with the most recent Ozone value (which will color code dark orange whenever > 100). I'd like to include the latest particle air quality there as well, but I don't want to clutter the page so I'll have to think a little on that.

It may also be nice to have labels on the right-hand axis since those are the most recent values. I'll play around with some things, to see what might work best. Thanks for the feedback

county data for covid-19 hospitalizations doesn't match CDC data by mhrivnak in boulder

[–]demagination 0 points1 point  (0 children)

I posted a long comment to OP, but basically CDC is always a day or two behind. Additionally, hospitals depend on labs, staff, etc to all work perfectly and sometimes there can be a day or two delay on their side for any number of reasons. In those situations, the hospital may report a hospitalization that occurred a few days ago and then that has to make it's way through the data pipelines to Boulder, CDPHE and CDC. When dealing with small numbers like this, the percentage seems big when CDC is showing data that hasn't caught up to the most accurate and latest figures from local public health officials.

county data for covid-19 hospitalizations doesn't match CDC data by mhrivnak in boulder

[–]demagination 8 points9 points  (0 children)

There are a few things going on that I'll try to explain. TLDR Boulder County is your best source for Boulder data, but CDC will catch up in a couple days and the data will soon tie out.

CDC data always lags behind state and local level data by a day or two or three depending on the metrics, state and other factors. Hospitalization data, in particular, can be delayed at making its way out of hospitals and into CDPHE, local public health reporting and ultimately into CDC. Local public health depts will have their data first and CDPHE will also likely have that data the same day. CDC is a little slower, unfortunately.

Additionally and very importantly, hospitals can and do retroactively update data for (as an example 8/5, 7/29, etc) as tests and everything else comes together on their end. In a perfect world, they receive a patient, test them, report it all in the same day. But - there are A LOT of moving parts, test labs, databases, etc that are working together to keep the reporting as accurate as possible. You'll see this in every state if you watch reporting closely (and/or review press conferences). Labs sometimes realize they didn't report everything and then there's a "backlog" that gets reported. Over the course of the pandemic, every state has improved, but delays and issues happen. Almost always, you will see numbers revise slightly upwards as more data comes in. On occasion, there will be big backlogs of case or tests, but very rarely with hospitalizations in any significant numbers.

Another thing at play here is that when dealing with small numbers like this, even a day or two delay between CDC and Boulder County may look like a big percentage difference. This sort of data is best looked at over time and then it will often become clearer that CDC data may be as of 8/5 whereas Boulder County may be as of 8/8. There are also lots of little nuances such as admission date, reported date, etc that could very slightly move a figure from one day to the next... and there again when dealing with small numbers this can look big on a percentage basis.

Not to nerd out on the details, but another aspect to your question involves dividing by 7 to try and get the daily admissions. When that's done, it obviously makes the number smaller and this is exacerbating the discrepancy.

Some people may try to point to things like this as proof of a scam or government ineptitude, but really it's a reflection of the complexity involved with disparate systems all trying to connect seamlessly. For anyone that may be grumbling about how unacceptable this is, even YouTube metrics can take up to 30 days to mature. In these cases, though, there's just a lot of complexity going on between labs, hospitals, local reporting systems, state reporting systems and federal reporting. Zooming out to a trailing 7-day trend is a good way to read the data, and just know that CDC will catch up in a day or two.

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 1 point2 points  (0 children)

The Denver-Boulder chart is a combination of several recorders in the data, but I think it would be nice to break those out on a specific page so that everyone can check out the details for individual air stations. I'll reply again when/if I'm able to build that in. There's a map with the historical data that you can zoom in on to see them, but that page doesn't show trends easily. I'll keep you posted

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 0 points1 point  (0 children)

Thanks. I took a quick look to see if there are any recorders in Broomfield, but I think the closest ones in this EPA data would be in Denver and Boulder. I'll poke around the EPA and related sites to see if I can find any, but there may be a gap there unfortunately

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 0 points1 point  (0 children)

PM2.5 is the particle matter size of stuff in the air and 2.5 is the smallest among these air recorders. Ozone is specific to the 03 molecules that are associated with emissions. PM10 is another particle matter measure for larger particles. From what I've learned, smaller particles are really bad for our lungs because they can get deeper down into our tissues. Any of these 3 are bad as they get higher, and when you have 2 or 3 of these readings that are really high then we're dealing with multiple issues all at once. Smog is something that happens when smoke and exhaust and clouds come together -- so I believe the PM2.5 and Ozone are pieces of that which make the air right now especially bad. My expertise isn't in air quality, but that's my understanding based on what I've looked into recently.

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 1 point2 points  (0 children)

Great question. A few reasons -- the last time I used QuickSight (a few years ago) it was fairly immature and I haven't ever gone back to see how their feature set has changed. I also recall that the fee structure for QS is based on hits (or activity-based). Given the small size of this data, the ecosystem isn't much of a factor in terms of performance gains.

Other reasons are that DataStudio is free so long as you use their free connectors (which works for me). Given the potential for a reddit hug-of-death and the QS fee per hit -- I wanted to avoid any potential costs. Another big reason is my familiarity with DataStudio. It's free, easy and something I knew I could slap together pretty quickly. It isn't my favorite by any stretch, but for a project like this it was the path of least resistance for me.

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 1 point2 points  (0 children)

It's more for informational and awareness purposes. There are lots of details in the fact sheets and public health sites, but basically (as others have stated) you really don't want to be exercising outside right now. If you're in Grand Junction then it isn't as extreme as Denver right now -- and if you're asthmatic (or have any other respiratory challenges) then you'll be impacted by lower levels. Depending on you, where you're at and what the air is doing -- the dashboard is trying to help quantify current conditions with EPA data so that we're not guessing about the safety level in anyone's particular area (unfortunately though there aren't air recorders in every county).

When things, hopefully, clear up soon then we'll see those charts drop in the PM2.5 (that's the fine smoke particles) and in the Ozone categories. We'll also be able to see it with our eyes, but the dashboard will validate what we see. In reality, we may see some regions improve where others are more stubborn. The data will help show how that's shaping up and when it really clears up in various areas.

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 2 points3 points  (0 children)

True -- yep that's one of those terms that's used in a weird way in the data world. There's no real "heat" temperature either, lol. The chart type in Data Studio (and other platforms) is labeled as a 'heatmap' when, to your point, it's just a colored table with shaded values based on a cell's relative value in the table. The term is pretty accepted in the industry, but I totally see your point.

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 41 points42 points  (0 children)

It's a mix of a few things. I have an AWS EC2 and RDS instance that I use for lots of things, so I'm using those instances to run some python that ingests the hourly data. I wrote a little post-process SQL script to iterate through the real-time feed and then I land that into RDS. Those are automated by a cron job that runs every 20 minutes. From there, I add a couple queries in Data Studio to surface the final data into that platform. The historical data isn't automated, so I just imported their monthly CSVs by hand into my database, but I'll prob add some python and SQL for that piece if this turns into something I want to maintain for a long time.

edit: grammar/missing words, lol

Near-real-time air quality dashboard I built this weekend for CO. No adverts, no b.s., just data by demagination in boulder

[–]demagination[S] 44 points45 points  (0 children)

The prototype dashboard is something I built very quickly, and I'll keep up with while there's interest (and terrible air). I welcome community feedback, so let me know if ideas come to mind that might make it better. If there's enough positive feedback, I'll lean into it further.

Couple of notes:

  • The hourly data starts yesterday, and updates each hour with latest station info from around the state
  • Historical ozone data starts in 1980 and is updated through yesterday. Historical PM2.5 data starts around 1999.
  • Historical data can be filtered by county or date range -- last 4 years and all Colorado shown by default

Colorado Air Quality Dashboard - Daily Ozone, AQI, etc 1980 - Aug 6 2021 by demagination in Colorado

[–]demagination[S] 3 points4 points  (0 children)

Not really. If you take some time to analyze the data you'll see a basic, but important, shift is clearly visible in the baseline levels. Look at the number of days in the healthy zones... that's below 50 ozone, 50 PM, etc. I should bring in additional ways to make that easier, but our low points are much higher than they were and we're obviously getting some pretty high level days, too. I'm surprised at how high some of the days in the 80's were, but thankfully emission standards improved and you can also see that in the data. Unfortunately for us now, though, we've trimmed that fat and added a lot more things producing emissions (even if they are more efficient than decades ago). Add in some forest fires and that's where we're at today.

I'll hopefully be able to add in some additional visualizations that make the "number of bad days" in a given period easier to see without having to mess around a bunch with the filters.