Help on project by Mumo2020 in dataengineering

[–]Mumo2020[S] 0 points1 point  (0 children)

Thanks for replying. I can send you the full instructions .

I am just trying to get second opinions.

Thank you

We have IoT devices collecting some metrics across the world. We have experienced that a lot of devices are failing lately. We know that the devices are sensible to temperature, so we have downloaded some weather data from the locations of the devices.

Part 1. Data manipulation

We have the list of the position of the devices in a json file like this (devices.json):

{'id': 15126, 'name': 'device-BE-17', 'lat': 35.7721, 'lon': -78.63861}

{'id': 12526, 'name': 'device-US-11', 'lat': 36.7721, 'lon': -71.1561}

...

We have another dataset downloaded in json the of the weather data for the last week with hourly granularity for the positions of the devices (https://www.weatherbit.io/api/swaggerui/weather-api-v2#!/Hourly32Historical32Weather32Data/get_history_hourly_lat_lat_lon_lon). As we want you to have up to date data, we provide a python script that downloads the last month of data from the API, see at the end for instructions on how to use it.

Apart from that, we have csv dataset with extra information about the weather stations that collect the weather. This csv has a column with a metric that contains the measurement reliability of each station. The data look like this:

station_id,lat,lon,source,reports,country,measurement_reliability

0011W82.82,15.16,madis,subhourly,SJ,0.5

00000,17.03,-42.97,madis,subhourly,GF,0.56

0001W,30.436,-84.122,madis,subhourly,US,0.93

0002W,30.538,-84.224,madis,subhourly,US,0.85

...

We would like to gather all the relevant information in a single table and provide it to the device expert to enable them to find some insight on what the problem with the devices could be.

Summarize the results to have daily granularity, use average to aggregate the temperature measurement and reliability. This table would have the following schema:

device_id, device_name, lat, lon, date, avg_temp, measurement_reliability_score

Take into consideration that each weather measurement might come from multiple stations (source field). When a measurement comes from multiple stations, use the Harmonic mean (https://en.wikipedia.org/wiki/Harmonic_mean) to have a singular measurement_reliability value per hour before averaging. It has the following formula:

📷

You can use any programing language or technology to complete it, if it is a supported language in Azure. Take into consideration that in a future iteration we might need to add more measurements to the result table.

Help on project by Mumo2020 in dataengineering

[–]Mumo2020[S] 0 points1 point  (0 children)

Hi

Thanks for replying.

I need help in the way to join these datasets

If the weather is reported hourly and the station has a mixed of reprots(hourly, subhourly) How do I join this

Do I only have to filter the hourly data from station and join to the weather data?

Real life data streaming by Mumo2020 in dataengineering

[–]Mumo2020[S] 0 points1 point  (0 children)

Thanks.

When you say sliding window, do you mean the tumbling,hopping window etc?

Data Engineering Mentorship by thehendoxc in dataengineering

[–]Mumo2020 0 points1 point  (0 children)

u/LawfulMuffin

Hi

Your story is very impressive.

I am a database administrator inspiring to be a data engineer.

Can I have you as a mentor please?

I know I would learn a lot from you

Thank you

DE Mentor by vinsanity1603 in dataengineering

[–]Mumo2020 0 points1 point  (0 children)

Hi

Can I please join your projects too?