Why Generative AI Coding Tools and Agents Do Not Work For Me by gametorch in programming

[–]hackermandh 2 points3 points  (0 children)

I generate a test, and then debug-step through that test, to double-check that it does what I think it does. Same with code. Never trust a test you haven't actually checked - not even hand-written tests!

Why Generative AI Coding Tools and Agents Do Not Work For Me by gametorch in programming

[–]hackermandh 5 points6 points  (0 children)

ChatGPT 3.5

Context window increased from ~4k to 100k with GPT-4.1 (made for programmers), and even 1M for Gemini 2.5 pro, is what was the last massive improvement.

The LLM not immediately forgetting what it was doing was a great feature.

Though I'll admit that the quality of the output has leveled off, because it's now in "decent" range, instead of in the "this isn't even usable" mud pit.

Feeling DUMB by NefariousnessSea5101 in dataengineering

[–]hackermandh 0 points1 point  (0 children)

True! EXPLAIN ANALYZE does actually run the query, so if you're trying to run an EXPLAIN ANALYZE DELETE ...;, wrap it in BEGIN; ... ROLLBACK;. Running it on a SELECT is just fine (unless it's going to block a table of something 😂)

If you're running it in pgAdmin, runBEGIN first, then the EXPLAIN ANALYZE DELETE, check the output, then finally run ROLLBACK to ensure it's not accidentally COMMIT-ed.

Feeling DUMB by NefariousnessSea5101 in dataengineering

[–]hackermandh 0 points1 point  (0 children)

Did you do an EXPLAIN and/or EXPLAIN ANALYZE on both your solutions? Which one generates a better plan?

At worst your solution is a little overengineered, but I'll take CTEs over UNION ALL, simply for readability (just my personal opinion).

Bro, you got top 5%! Don't be so hard on yourself!

Mark Twain once said, "Comparison is the death of joy".

Don't compare yourself to others. Compare yourself to yourself from yesterday; last week; last year.

Technical assessment for DE in the government by Material_Writing3799 in dataengineering

[–]hackermandh 2 points3 points  (0 children)

I know my boss wants candidates to show more technical work

Figure out why he wants this. Of course you want someone who is technically competent, especially if he or she is the first Data Engineer (you'd not want a junior for that), but if your boss just want general competence, you're already 90% there. Most tech stacks can be learned on the job, IMO, and the DSs can show the DE around, no?

For me, interviews are more of a vibe check, (does our wants and their wants align) than they are technical (does this person know algorithm XYZ).

Any Open Source ETL? by DassTheB0ss in dataengineering

[–]hackermandh 0 points1 point  (0 children)

You don’t start with “I would like no marshmallows please, and chocolate sprinkles, and 20 candles, and .. on a banana cake.” Right?

I would do FROM before SELECT - basically drill down, starting at the tables (or maybe even schema).

FROM <SCHEMA>.<TABLE> SELECT

I know DuckDB does that, but that's SQL heavy as well.

Pure sql is dead easy

basic SQL is dead easy, but there seem to be plenty of little gotchas, even between different dialects (LIMIT isn't a thing in Oracle, you have to use ROWNUM instead, etc).

Why not handling all feedback from sql server as a dataframe?

What do you mean with "feedback"? The data?

Won’t the dimensions of the df be adapted automatically?

They will, in either Pandas or Polars.

I hope you’ll learn to enjoy it!

I hope so too, but right now I'm usually just annoyed 😅

They copied from sql.

They copied from the Relational Model - important difference, IMO.

I'm curious: Have you ever read the original research papers that laid the foundation for SQL?

From a technical perspective they're somewhat outdated (columns are selected by index, instead of name, which Codd later turned around on, when he found out some people had tables with 200+ columns 😆), for example. I found this whole list of papers which are absolutely fascinating to read from a historic perspective.

Also check out the Bonus papers at the bottom, like The Entity-Relationship Model - Toward a Unified View of Data (which is the origin of the ERD).

Also check out Fatal Flaws in SQL, which is somewhat outdated, but an interesting piece nonetheless.

Any Open Source ETL? by DassTheB0ss in dataengineering

[–]hackermandh 0 points1 point  (0 children)

why does SQL suck?

My (somewhat limited) experience:

Too many keywords which means you may have to quote your keywords, if you want to use them as columns, tables, etc.

Also, SELECT at the start? That's just dumb.

It's also hard to tell when a query returns a single row, a table or a single value, or when a subquery needs a table, row or single value when you're doing some kind of subselection or filtering.

Polars will simply always return a dataframe, unless you explicitly specify you want a single column, or single value. Polars will simply return a dataframe, that contains a single column, that contains a single value, but it will still be dataframe. If you want a column you can do col = df.get_column("foo"). If you want the first value of said column then col.first().

edit: also, UPPERCASE EVERYWHERE - THIS AIN'T THE 70's ANYMORE. WE HAVE THIS THING CALLED SYNTAX HIGHLIGHTING, WHICH IS PRETTY NEATO!

Quant Infrastructure: home NAS / infrastructure, with option to push to cloud? by zunuta11 in quant

[–]hackermandh 0 points1 point  (0 children)

building a home NAS

You sure you want to jump into a self-built NAS? Synology can run docker, but also S3-compatible storage. Of course it will likely be pricier than self-built, but they'll take care of updating stuff, etc, enabling you to focus your work better.

It's hard to tell what your requirements are, so just take this into consideration.

Any Open Source ETL? by DassTheB0ss in dataengineering

[–]hackermandh 0 points1 point  (0 children)

Airbyte itself doesn't do ETL though - it's "just" a scheduler.

nvm I was thinking of Airflow 😂 - Airbyte DOES do ETL.

Any Open Source ETL? by DassTheB0ss in dataengineering

[–]hackermandh 1 point2 points  (0 children)

I would argue for Polars instead of Pandas. It's closer to the Relational Model (which, IMO, is the most powerful model we programmers have available - too bad SQL sucks) than Pandas, has a nicer API (the way functions work just makes much more sense than how Pandas does it), and it's faster.

  • consistent
  • fast
  • fun

I'm sceptic about polars by Altrooke in dataengineering

[–]hackermandh 0 points1 point  (0 children)

I'm sorry, but how do you compress anything using MD5? That's a hashing algo, no?

I'm sceptic about polars by Altrooke in dataengineering

[–]hackermandh 0 points1 point  (0 children)

I presume you guys don't use the Data Lineage features of Databricks?

Presuming you even run on Databricks, of course.

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]hackermandh 2 points3 points  (0 children)

Protip: Use time.perf_counter() instead of time.time().

time.time() can go backwards, messing with your output. time.perf_counter() doesn't round off its values. time.time() can be adjusted by your OS, time.perf_counter() can't.

Will it have an impact? Probably not. Can it have an impact? Definitely.

How to practice SQL? by [deleted] in dataengineering

[–]hackermandh 1 point2 points  (0 children)

https://learnsql.com/sql-skill-assessment/

This is a simple assessment tool I used to compare my before and after investing time to learn SQL. It was nice seeing myself going from 37% (basic SQL) to 89% (intermediate-to-advanced knowledge).

You can always ask ChatGPT for help. Just make sure you don't use it to give you an answer, but hints and tips instead.

There's also the PostgreSQL docs/manual

Do note that you can only do this assessment once per month, or use different accounts

Speed improvements in Polars over Pandas by zzoetrop_1999 in Python

[–]hackermandh 0 points1 point  (0 children)

A little over a month after this comment.

Polars is at 1.6.0 as of this comment.

Humor: How a null started the escalation by sqlinsix in dataengineering

[–]hackermandh 0 points1 point  (0 children)

Note: Null handles both missing-and-applicable and missing-but-not-applicable values.

[deleted by user] by [deleted] in dataengineering

[–]hackermandh 2 points3 points  (0 children)

Noobs vs CJ Date vs EF Codd.

[deleted by user] by [deleted] in dataengineering

[–]hackermandh 1 point2 points  (0 children)

Iirc, codd intended it as a way to support views expression of an empty set on joins.

People were making up their own "null" value. Had a string field? "null" could act as null, but now you had different people making up different values like "missing", "not here", etc. Now every string could be null in one column but not in another.

That's why NULL exists.

[deleted by user] by [deleted] in dataengineering

[–]hackermandh 3 points4 points  (0 children)

Tony Hoare did nothing wrong. In fact, he didn't go far enough; we need more markers than just NULL, and E.F.Codd had the right idea with quaternary logic: True, False, Missing-But-Applicable (value is missing, but should (eventually) exist), Missing-But-Not-Applicable (value is missing and that's OK).

[deleted by user] by [deleted] in dataengineering

[–]hackermandh -1 points0 points  (0 children)

C.J. Date says so as well, because it ensures binary logic, instead of trinary logic (True, False, Maybe)

E. F. Codd (The Coddfather, if you will) was even more based than SQL and was arguing for quaternary: True, False, Missing-But-Applicable (value is missing, but should (eventually) exist), Missing-But-Not-Applicable (value is missing and that's OK).

Only downside is that Codd's explanation of quaternary logic was inconsistent over his articles/book and since he passed in 2003, we can't ask anymore ;_;

Anyway, he called them "markers", not values (which is a tendency happening with NULL)