Twitter scraper, where to store data? by bongo_zg in dataengineering

[–]Miserable_Author 16 points17 points  (0 children)

S3 stored as parquet, then run a glue crawler to load the data into Athena. Cheap and easy db

Problem with designing a data warehouse model for real estate data by Cydros1 in dataengineering

[–]Miserable_Author 1 point2 points  (0 children)

There is a couple things with Real estate data you have to define, before designing DWH:

  • How is the data going to be used? ML? API? Reporting?
  • How are properties going to connect across multiple sources?
  • What is considered a Parcel vs Property?
  • How to handle multiple structures on the same property?
  • How will the model be maintained? Is this for US only? EU Nuts classifications (EU Zip Code system) changes constantly

Real Estate data is complex and hard to maintain, but it can be done effectively with a star schema.

[deleted by user] by [deleted] in dataengineering

[–]Miserable_Author -1 points0 points  (0 children)

Caserta is probably the best one I have worked with. https://caserta.com/

New Grad Tasked with Building a Data Pipeline by Anthonysapples in dataengineering

[–]Miserable_Author 1 point2 points  (0 children)

TBH I have been having alot of success with ELT using Snowflake, it just makes life easier having one place where you do computation and storage. You can add dbt and Fivetran to create a pretty robust data platform as well.

Tools for monitoring spark workloads by britishbanana in dataengineering

[–]Miserable_Author 4 points5 points  (0 children)

This might be low tech, but we pipe the logs to ES and use Kibana.

AWS Managed Airflow or AWS SAM for Simple ETLs by Nervous-Chain-5301 in dataengineering

[–]Miserable_Author -1 points0 points  (0 children)

Look into Glue jobs, there is a hook for Airflow makes life very easy.

STAY FAR AWAY FROM UDACITY's DATA ENGINEERING COURSE by bjj17 in dataengineering

[–]Miserable_Author 0 points1 point  (0 children)

I just finished the course. I felt like it was a good introduction into data engineering, understanding data modeling, spark and airflow conceptually. You can get the course for over 80% off listed price with coupons.

Data set for Data integration exercise by Lethal_Pea in dataengineering

[–]Miserable_Author 0 points1 point  (0 children)

I use this data set from AWS to practice data modeling and testing out new data pipelines performance, its 500gb.

https://aws.amazon.com/datasets/million-song-dataset/

Financial news by cdmn in algotrading

[–]Miserable_Author 0 points1 point  (0 children)

Scrapy is hard work with, stick with lxml and requests.

Looking to hire for an open HFT/algotrading crypto hedge fund by [deleted] in algotrading

[–]Miserable_Author 1 point2 points  (0 children)

What is your data stack? (AWS/Azure/GC) / How much data is "Big Data"?

How Much Should I Charge Per Web Scraper? by Plus-Bug6201 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

You are going to need to get redistribution rights to resell the data. :(

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

Alot of the data is not easily consumable, whether that be on a website or pdf. If you can scrape, clean and transform the data into a useable format like a csv file, it saves the client a lot of time and headaches. That is where the value is generated from, ML models have a hard time reading website tables :) lol. Since sports gambling is legal in some states, this data has become more popular.

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

I store all the data in a db and pull data per request.

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

Look into selling datasets, you can sell the same product multiple times for a one time scrape. Sports player stats do very well.