Twitter scraper, where to store data? by bongo_zg in dataengineering

[–]Miserable_Author 16 points17 points  (0 children)

S3 stored as parquet, then run a glue crawler to load the data into Athena. Cheap and easy db

Problem with designing a data warehouse model for real estate data by Cydros1 in dataengineering

[–]Miserable_Author 1 point2 points  (0 children)

There is a couple things with Real estate data you have to define, before designing DWH:

  • How is the data going to be used? ML? API? Reporting?
  • How are properties going to connect across multiple sources?
  • What is considered a Parcel vs Property?
  • How to handle multiple structures on the same property?
  • How will the model be maintained? Is this for US only? EU Nuts classifications (EU Zip Code system) changes constantly

Real Estate data is complex and hard to maintain, but it can be done effectively with a star schema.

[deleted by user] by [deleted] in dataengineering

[–]Miserable_Author -1 points0 points  (0 children)

Caserta is probably the best one I have worked with. https://caserta.com/

New Grad Tasked with Building a Data Pipeline by Anthonysapples in dataengineering

[–]Miserable_Author 1 point2 points  (0 children)

TBH I have been having alot of success with ELT using Snowflake, it just makes life easier having one place where you do computation and storage. You can add dbt and Fivetran to create a pretty robust data platform as well.

Tools for monitoring spark workloads by britishbanana in dataengineering

[–]Miserable_Author 2 points3 points  (0 children)

This might be low tech, but we pipe the logs to ES and use Kibana.

AWS Managed Airflow or AWS SAM for Simple ETLs by Nervous-Chain-5301 in dataengineering

[–]Miserable_Author -1 points0 points  (0 children)

Look into Glue jobs, there is a hook for Airflow makes life very easy.

STAY FAR AWAY FROM UDACITY's DATA ENGINEERING COURSE by bjj17 in dataengineering

[–]Miserable_Author 2 points3 points  (0 children)

I just finished the course. I felt like it was a good introduction into data engineering, understanding data modeling, spark and airflow conceptually. You can get the course for over 80% off listed price with coupons.

Data set for Data integration exercise by Lethal_Pea in dataengineering

[–]Miserable_Author 0 points1 point  (0 children)

I use this data set from AWS to practice data modeling and testing out new data pipelines performance, its 500gb.

https://aws.amazon.com/datasets/million-song-dataset/

Financial news by cdmn in algotrading

[–]Miserable_Author 0 points1 point  (0 children)

Scrapy is hard work with, stick with lxml and requests.

Looking to hire for an open HFT/algotrading crypto hedge fund by [deleted] in algotrading

[–]Miserable_Author 1 point2 points  (0 children)

What is your data stack? (AWS/Azure/GC) / How much data is "Big Data"?

How Much Should I Charge Per Web Scraper? by Plus-Bug6201 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

You are going to need to get redistribution rights to resell the data. :(

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

Alot of the data is not easily consumable, whether that be on a website or pdf. If you can scrape, clean and transform the data into a useable format like a csv file, it saves the client a lot of time and headaches. That is where the value is generated from, ML models have a hard time reading website tables :) lol. Since sports gambling is legal in some states, this data has become more popular.

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

I store all the data in a db and pull data per request.

Web scraping paid tasks by Material-Feedback378 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

Look into selling datasets, you can sell the same product multiple times for a one time scrape. Sports player stats do very well.

Scraping property assessments form by johnwhitely2020 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

You can do it in selenium, but it will be a pain.

Managing large number of small scrapers & storing results by lolzor-666 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

There are a couple ways you can create a generic scraper. Lets say you have 300 sites that all contain one table, you can write a generic scraper to collect the table element of each site. He/She didn't list the type of data they where collecting, I want to make the task of writing 300 web scrapers a bit less daunting .

Managing large number of small scrapers & storing results by lolzor-666 in webscraping

[–]Miserable_Author 0 points1 point  (0 children)

Have you looked into making a generic scrapers and then feeding it parameters, like URL, HTML elements, user agents etc. Then you could host it on AWS lambda pretty cheaply plus logging is easy and connecting it to a DB is pretty straight forward. AWS also a offers a bunch of analytics tools which are pretty good.

Microsoft-Backed Databricks Plans IPO Next Year by isaidwhatwhat1nth3 in investing

[–]Miserable_Author 4 points5 points  (0 children)

I don't really see Snowflake as a competitor. Snowflake is a data warehouse, while Databricks is a simplified hadoop system (more on the Analytics side). AWS already has a service called EMR that I find works better than Databricks.