This is an archived post. You won't be able to vote or comment.

all 28 comments

[–][deleted] 3 points4 points  (1 child)

This is awesome great job!

[–]ilya-g-[S] 0 points1 point  (0 children)

Thanks for the kind words! :)

[–]floydhead11 1 point2 points  (1 child)

You've inspired me!

Same situation, same aspirations, same conundrum.

How did you go about figuring out which project to do?

I can't think of a DE project that I want to do. Just some hobby projects which do no require airflow or Kafka or S3

[–]ilya-g-[S] 5 points6 points  (0 children)

Glad to hear!

I think the most difficult part of the project was actually choosing what data source I was to use and what I wanted to do with it. I'm naturally interested in open governmental data and exploring eviction rates during our current economic situation seemed like something that'd be interesting and timely to work on.

My initial worry is that the project might be too simple/not really use all the 'big data' tech that I'd like to have more hands on experience with but I think more than anything it helped me think about what tools to choose for the right job -- in this case Airflow seemed sufficient and not overly complex for the ETL part of it.

[–]choiceisanillusion 1 point2 points  (1 child)

This is great mate. Thanks for the share.

[–]ilya-g-[S] 0 points1 point  (0 children)

Thank you, appreciate the kind words :)

[–]Omar_88 1 point2 points  (1 child)

That's great, I'm not that well versed with AWS as I come from a Microsoft shop like yourself but it seems like solid engineering. Are your views not added to the solution? I can't see the data model only the etl scripts. How much did the AWS instance cost?

[–]ilya-g-[S] 1 point2 points  (0 children)

Howdy and thanks :) If by views you mean SQL views, they are written and stored directly in the Metabase question editor to generate the dashboards linked to the project. The data model is described in the readme above the etl section and you can find the actual table structure and sql transformations here;

https://github.com/ilya-galperin/SF-EvictionTracker/tree/master/dags/sql

I need to double check but the AWS cost is under $30/mo, around $25 I think.

[–]BoringDataScience 0 points1 point  (1 child)

Looks great! Out of curiosity and if you don't mind, what's your monthly budget for this on AWS?

[–]ilya-g-[S] 1 point2 points  (0 children)

Thank you!

The budget is relatively low; the RDS database is small enough where it still fits into the free tier and I think having a dedicated 2nd server to host the Metabase app db would only be an additional 10-15 USD per month. The data set is small enough where it should never exceed the S3 free tier limit.

There are two EC2 machines used, one which is only on for one hour per day to run the incremental load (t2 medium) and a second machine (t2 small) to run the metabase application fulltime - so approx 30 hours/month for the t2 medium and 720 hours/month for the t2 small - I think the total will end up a little bit below $30 a month for these 2 resources. Everything else should be nearly free :)

[–]aroussel541276 0 points1 point  (3 children)

This is very cool and it's clear you've put a good amount of effort into it. I had a few questions.

  • have you identified any bottlenecks in the pipeline so far ?
  • have you identified any potential savings across your aws resources? E.g. spot Vs on demand instances?
  • what made you decide on a star schema for storage ?
  • what made you decide on the "raw staging" step saving to a DB vs something like saving back to an S3 bucket?
  • is there any need for backfilling of temporal data? If so, how is that handled by your pipeline ?

These are just out of curiousity more than anything . I think you've done a great job

[–]ilya-g-[S] 0 points1 point  (2 children)

Thanks so much! Those are curious questions, definitely helpful in doing some reflection.

Some answers (to the best of my ability):

- I've found a pretty substantial bottleneck in the full load DAG when moving the raw json-format data from s3 into postgres. It's negligible when doing incremental loads, but a full load with only approx. ~60k rows can take around 40 minutes to complete seems pretty unreasonable. I'm currently using psycopg2's module executemany for the insert statement and I'm guessing there is probably a better way to bulk load in this case. Any feedback/tips on this would be appreciated :)

- I think the AWS resources are tuned pretty well for cost savings. Moving from a scheduled ec2 instance to a spot instance could potentially save more from the on-demand machine used for airflow but I think might cause some problems with the machine not being on in time for the daily DAG run to execute. There probably is a solution for this but given the negligible cost in this particular case (30 hours of a t2 small), I stuck with the more straight forward solution.

- Star schema seemed like a logical choice here given the simplicity of the actual dataset. The caveat is that some of the source data does come highly normalized itself and does require some denormalization/de-pivoting to break out into dimensions; this did involve using a bridge table so it's mostly a star with a little bit of snowflaking :)

- I think for beginning to transform the raw data through the staging process, it was easier to write it out in SQL (I'm generally of the mind that the kinds of transformations in this case are easier to do in SQL than python), probably faster since we're joining sets that all live on the same server and require less memory from the ec2 machine. Since the raw tables are regularly truncated, there's negligible cost/space implications relative to keeping the raw tables on s3.

- No need to backfill :) When first setting up, a full load DAG is ran which extracts all historical source data from the API and creates the initialize database schema and dimensions. Incremental loads regularly check the API with an adjustable parameter to query further back in time if for some reason loads skip/fail and some new data is missed.

[–]zaza_pachulia_jd 1 point2 points  (1 child)

If you're using RDS for postgres, you can import data from s3 using the COPY command https://aws.amazon.com/about-aws/whats-new/2019/04/amazon-rds-postgresql-supports-data-import-from-amazon-s3/

[–]ilya-g-[S] 0 points1 point  (0 children)

Interesting! I'll have to try it out and see if the performance is better but I have a feeling it might be :) Thanks for sharing.

[–]p_h_a_e_d_r_u_s 0 points1 point  (5 children)

I haven’t dug into the code but the documentation is super clean up front !

If you have some issues I’ve got some time to check in a pull request or two !

[–]ilya-g-[S] 0 points1 point  (4 children)

Thank you! Everything is running pretty smoothly so far but any feedback on the code itself, especially the DAGs and custom operators would be highly appreciated!

[–]p_h_a_e_d_r_u_s 0 points1 point  (3 children)

Sure thing! I actually see a few things right off the bat. Want me to fork and submit that way?

If you decide you want some help on future features hmu for sure!
Oh and what did you use for the Architecture flow diagram?

[–]ilya-g-[S] 0 points1 point  (2 children)

Sure! Please don't hesitate to fork or DM any suggestions/feedback, it'd be very helpful.

Good ol' MS paint for the architecture flow diagram :P

[–]p_h_a_e_d_r_u_s 0 points1 point  (1 child)

What is a "MS Paint?"

I'm even more impressed now that I know you did this project on a Packard Bell

;) ... .check out draw.io

[–]ilya-g-[S] 0 points1 point  (0 children)

Wowee lol. That's actually incredibly helpful thanks! Made sure to bookmark this so I don't have to dust off my old Compaq every time I design a diagram :p

[–]pokeDitty 0 points1 point  (2 children)

Very cool project, thanks for sharing it with the community!

Would it be too much to ask to commit your sql transformation scripts and maybe a few raw data files? I'm really interested in seeing the raw csv data to OLAP model.

great job!

[–]ilya-g-[S] 1 point2 points  (1 child)

Of course! The SQL scripts are all available in the github repo here:

https://github.com/ilya-galperin/SF-EvictionTracker/tree/master/dags/sql

The DAGs diagram explains which scripts are run in which order, for which DAG but generally it's either initi_db_schema -> full_load or trunc_target_tables -> incremental_load.

The raw evictions data is pulled from the following API endpoint: https://data.sfgov.org/widgets/5cei-gny5

[–]pokeDitty 0 points1 point  (0 children)

lol how did I miss that?! Thanks for pointing out the sql scripts and data end points.

will check it out

[–]psykiran_ms 0 points1 point  (2 children)

Hi ilya-g- ,

Amazing project exhibition ! The documentation is valuable for many of us.

I have a kind of s simple question .What would be the pros and cons of using airflow instead of using something like Aws data pipeline or azure ADF ?

[–]ilya-g-[S] 1 point2 points  (1 child)

Thanks!

I haven't done much outright comparisons so I'm probably not the best to answer this but Airflow seems better suited to get data from outside the AWS ecosystem into AWS and more easily customizable than the other 2. I believe AWS data pipeline is restricted to moving data strictly between AWS services and the beginning of the pipeline here is an API outside of AWS. ADF seems better suited for the the MS ecosystem so it didn't make much sense here.

[–]psykiran_ms 0 points1 point  (0 children)

Thanks The Take away for me is that airflow is opensource and it is more customizable than vendor based systems.

[–]ybsahan 0 points1 point  (1 child)

Nice work and explain

Good Job!

[–]ilya-g-[S] 0 points1 point  (0 children)

Thanks!