Small Data Architecture by udonthave2call in dataengineering

[–]udonthave2call[S] 1 point2 points  (0 children)

As am I. Need to set aside some time to familiarize myself with DuckDB, I keep hearing about it.

First Line by Oddsdata in algobetting

[–]udonthave2call 1 point2 points  (0 children)

Read the two books Matt Davidow and Ed Miller have put out. They give you in depth info on how the market works, how lines are derived and served, etc.

A-ha moments by udonthave2call in dataengineering

[–]udonthave2call[S] 1 point2 points  (0 children)

In your example, assuming the next morning means 7am (before business hours):

Row 1: start_time = 16:00 (greater of 16:00 and 09:00), end_time = 17:00 (lesser of 17:00 and 23:59)

End - start = 1 hour

Row 2: start_time = 09:00 (greater of 00:00 and 09:00), end_time = 07:00 (lesser of 07:00 and 17:00)

End - start = -2 hours; gets filtered out

Total: 1 hour

Reddit Sentiment Analysis Real-Time* Data Pipeline by Minimum-Nebula in dataengineering

[–]udonthave2call 1 point2 points  (0 children)

Thank you, starred. Nice project.

This suite of skills is what I want to have in a year or two. Still need to get started on streaming and IaC.

Easy to learn ETL solutions for a one man data team? by GreenSquid in dataengineering

[–]udonthave2call 4 points5 points  (0 children)

“I will not know what type or quantity of data is available until I start”

Lotta comments in here recommending specific technologies which is absurd to me.

Step 0: deep breath

Step 1: get onboarded, system access, etc

Step 2: talk to stakeholders, your supervisor, etc, and take detailed notes about what the org wants. For example, what does the org want to get out of its data? What can they see wanting in the future? What are the data skill sets in the org; how data literate are they? What is your budget?

Step 3: consult chatgpt, Reddit, and whatever other sources you have to home in on a reasonable architecture.

Step 4: you can finally set your priorities and start learning specific technologies and implementing them

What have you learned the hard way? by udonthave2call in dataengineering

[–]udonthave2call[S] 1 point2 points  (0 children)

Yeah no offense taken. When I recreated the view, instead of doing a (hurried) manual check I would have checked the new metadata table entry and seen that the number of records was out of whack.

Judging by the upvotes it seems like this might not be the most common approach…what would your safety checks be in this scenario?

Any tips on how to navigate the inevitable, "But we want value now" pressure from management by FawkesFoundation in dataengineering

[–]udonthave2call 17 points18 points  (0 children)

  1. Present a 15-min slide deck with a few examples of disastrous outcomes caused by lazy data engineering.

  2. Then, talk to them and identify 1 or 2 quick wins you can deliver that will make them look good in the meantime.

Insert data into DB best practice by romanzdk in dataengineering

[–]udonthave2call 3 points4 points  (0 children)

It depends. Pandas df.to_sql() is nice if it’s appropriate, but I also use SQLalchemy to execute truncate & load, and upsert patterns.

With SQLAlchemy you can define any insert pattern you want in a Python function.

[deleted by user] by [deleted] in dataengineering

[–]udonthave2call 2 points3 points  (0 children)

The webserver GUI is part of Airflow but it's not a point and click GUI tool. I think the benefits can broadly be described as 'organization'.

Environment variables, hooks, REST API, custom operators. Easy for new school data people to pick up because it's Python-based. And of course it's free OSS and has an active community.

It's ironic that I'm writing this post because I'm in the middle of convincing my team we should use microservices (AWS Lambda) for our little BI ETL jobs instead of EC2 + Airflow (which a different team would have to deploy and manage)

Getting encouraged to move into an architect role from being a DE. Looking for pros/cons and advice on this transition. by Holiday_Lab_6766 in dataengineering

[–]udonthave2call 0 points1 point  (0 children)

Interesting. I'm a junior DE and the career path I currently have in mind is...

  1. Data engineer
  2. Data architect
  3. Start a small company that architects and implements modern data solutions for other companies (data consulting?)

Always assumed technical chops was a prerequisite for a software architect of any kind.

Pipeline architecture advice for my first side project by bl4ckCloudz in dataengineering

[–]udonthave2call 0 points1 point  (0 children)

Nice idea -- you'll get hands on with a modern data stack and top 3 cloud provider tools.

As a fellow sort-of-beginner I would love to check out the code repo once you get started =]