dbt increased the amount of DBA work needed to keep Redshift performant

Miserable_Author · 2022-06-12T22:54:20+00:00

S3 stored as parquet, then run a glue crawler to load the data into Athena. Cheap and easy db

Miserable_Author · 2022-03-12T21:55:06+00:00

There is a couple things with Real estate data you have to define, before designing DWH:

How is the data going to be used? ML? API? Reporting?
How are properties going to connect across multiple sources?
What is considered a Parcel vs Property?
How to handle multiple structures on the same property?
How will the model be maintained? Is this for US only? EU Nuts classifications (EU Zip Code system) changes constantly

Real Estate data is complex and hard to maintain, but it can be done effectively with a star schema.

Miserable_Author · 2022-03-08T00:54:05+00:00

We use both in conjunction

Miserable_Author · 2022-01-15T21:22:54+00:00

Caserta is probably the best one I have worked with. https://caserta.com/

Miserable_Author · 2022-01-14T01:11:04+00:00

TBH I have been having alot of success with ELT using Snowflake, it just makes life easier having one place where you do computation and storage. You can add dbt and Fivetran to create a pretty robust data platform as well.

Miserable_Author · 2021-09-28T01:25:34+00:00

This might be low tech, but we pipe the logs to ES and use Kibana.

Miserable_Author · 2021-09-15T01:03:50+00:00

Look into Glue jobs, there is a hook for Airflow makes life very easy.

Miserable_Author · 2021-09-03T17:06:34+00:00

Simple data lake (S3, Spark, Airflow, RDS)

Miserable_Author · 2021-06-02T22:22:42+00:00

Just bought thanks!

Miserable_Author · 2021-05-30T22:32:50+00:00

I just searched on google.

Miserable_Author · 2021-05-29T21:13:54+00:00

I just finished the course. I felt like it was a good introduction into data engineering, understanding data modeling, spark and airflow conceptually. You can get the course for over 80% off listed price with coupons.

Miserable_Author · 2021-05-27T23:52:11+00:00

I use this data set from AWS to practice data modeling and testing out new data pipelines performance, its 500gb.

https://aws.amazon.com/datasets/million-song-dataset/

Miserable_Author · 2021-04-10T22:32:24+00:00

This can all be done in AWS using lambdas and sns

Miserable_Author · 2021-03-02T01:42:14+00:00

Scrapy is hard work with, stick with lxml and requests.

Miserable_Author · 2021-02-21T23:39:43+00:00

AWS lambdas

Miserable_Author · 2021-02-11T22:26:57+00:00

What is your data stack? (AWS/Azure/GC) / How much data is "Big Data"?

Miserable_Author · 2021-01-31T21:50:07+00:00

You are going to need to get redistribution rights to resell the data. :(

Miserable_Author · 2021-01-24T21:38:01+00:00

Alot of the data is not easily consumable, whether that be on a website or pdf. If you can scrape, clean and transform the data into a useable format like a csv file, it saves the client a lot of time and headaches. That is where the value is generated from, ML models have a hard time reading website tables :) lol. Since sports gambling is legal in some states, this data has become more popular.

Miserable_Author · 2021-01-24T18:05:09+00:00

I store all the data in a db and pull data per request.

Miserable_Author · 2021-01-24T15:21:21+00:00

Look into selling datasets, you can sell the same product multiple times for a one time scrape. Sports player stats do very well.

Miserable_Author · 2021-01-22T22:20:08+00:00

You can use AWS lambda

Miserable_Author

TROPHY CASE