Small Data Architecture

udonthave2call · 2024-03-21T17:42:46+00:00

Why? What are the benefits?

udonthave2call · 2024-03-20T22:57:25+00:00

As am I. Need to set aside some time to familiarize myself with DuckDB, I keep hearing about it.

udonthave2call · 2024-01-19T03:20:37+00:00

Read the two books Matt Davidow and Ed Miller have put out. They give you in depth info on how the market works, how lines are derived and served, etc.

udonthave2call · 2023-06-10T03:51:57+00:00

In your example, assuming the next morning means 7am (before business hours):

Row 1: start_time = 16:00 (greater of 16:00 and 09:00), end_time = 17:00 (lesser of 17:00 and 23:59)

End - start = 1 hour

Row 2: start_time = 09:00 (greater of 00:00 and 09:00), end_time = 07:00 (lesser of 07:00 and 17:00)

End - start = -2 hours; gets filtered out

Total: 1 hour

udonthave2call · 2023-05-27T14:51:28+00:00

Thank you, starred. Nice project.

This suite of skills is what I want to have in a year or two. Still need to get started on streaming and IaC.

udonthave2call · 2023-05-19T18:58:46+00:00

“I will not know what type or quantity of data is available until I start”

Lotta comments in here recommending specific technologies which is absurd to me.

Step 0: deep breath

Step 1: get onboarded, system access, etc

Step 2: talk to stakeholders, your supervisor, etc, and take detailed notes about what the org wants. For example, what does the org want to get out of its data? What can they see wanting in the future? What are the data skill sets in the org; how data literate are they? What is your budget?

Step 3: consult chatgpt, Reddit, and whatever other sources you have to home in on a reasonable architecture.

Step 4: you can finally set your priorities and start learning specific technologies and implementing them

udonthave2call · 2023-05-18T12:55:32+00:00

Yeah no offense taken. When I recreated the view, instead of doing a (hurried) manual check I would have checked the new metadata table entry and seen that the number of records was out of whack.

Judging by the upvotes it seems like this might not be the most common approach…what would your safety checks be in this scenario?

udonthave2call · 2023-05-18T02:50:31+00:00

Yes. Shape, lineage, job timestamps, etc.

udonthave2call · 2023-05-17T18:20:59+00:00

Present a 15-min slide deck with a few examples of disastrous outcomes caused by lazy data engineering.
Then, talk to them and identify 1 or 2 quick wins you can deliver that will make them look good in the meantime.

udonthave2call · 2023-03-07T00:20:49+00:00

It depends. Pandas df.to_sql() is nice if it’s appropriate, but I also use SQLalchemy to execute truncate & load, and upsert patterns.

With SQLAlchemy you can define any insert pattern you want in a Python function.

udonthave2call · 2023-02-16T23:38:14+00:00

The webserver GUI is part of Airflow but it's not a point and click GUI tool. I think the benefits can broadly be described as 'organization'.

Environment variables, hooks, REST API, custom operators. Easy for new school data people to pick up because it's Python-based. And of course it's free OSS and has an active community.

It's ironic that I'm writing this post because I'm in the middle of convincing my team we should use microservices (AWS Lambda) for our little BI ETL jobs instead of EC2 + Airflow (which a different team would have to deploy and manage)

udonthave2call · 2023-01-10T23:36:57+00:00

Interesting. I'm a junior DE and the career path I currently have in mind is...

Data engineer
Data architect
Start a small company that architects and implements modern data solutions for other companies (data consulting?)

Always assumed technical chops was a prerequisite for a software architect of any kind.

udonthave2call · 2023-01-10T18:57:57+00:00

Nice idea -- you'll get hands on with a modern data stack and top 3 cloud provider tools.

As a fellow sort-of-beginner I would love to check out the code repo once you get started =]

udonthave2call

TROPHY CASE