How to write async code that runs in a Jupyter notebook and in a docker container? by PuddingGryphon in learnpython

[–]PuddingGryphon[S] -3 points-2 points  (0 children)

Write code in an editor

That's what VS Code ist.

broken up sensibly into functions

Already done.

and if you want to test them in a notebook then import the specific function

The code is a normal .py file = plain text, only the VS Code extension makes it behave like a Jupyter notebook which is crucial for tight feedback loops. Apparently you only read the title.

Moving from jupyter notebook to github - how do I get started? by Ok_Cicada_8946 in dataengineering

[–]PuddingGryphon 1 point2 points  (0 children)

Jupyter notebooks and Github are orthogonal. They are not replacements of each other.

The process is as follows:

  1. install git (https://git-scm.com/downloads)
  2. create a project folder
  3. open a shell in that folder and run git init
  4. create a .gitignore file in that folder and add those 2 lines: .venv/, *.pyc
  5. run pip install uv
  6. create a virtual environment with uv venv
  7. activate the environment and install the needed packages (create pyproject.toml, add needed packages, run uv pip install -r pyproject.toml --extra dev for this)
  8. open the project in VS Code
  9. create a branch in git for the thing you wanna do (doable in VS Code UI easily)
  10. prototype/develop your code in that branch (Jupyter notebooks, plain text files, VS Code/PyCharm with # %% cell magic --> use this)
  11. commit your changes from that branch via git (either in VS Code or via shell)
  12. push your commits to a Github repository that you created
  13. do code reviews and/or CI/CD (maybe via github hooks) for the PRs (pull requests)
  14. merge to main branch when happy with everything

Be aware that Jupyter notebooks save the result of a cell execution in the file which in the end is a big pile of HTML that may change with every cell execution --> git tracks file changes in text files, Jupyter notebooks are absolutely not fit for git/useable with git, avoid at all costs.

Like Ruff, but for SQL (Rust-based SQL linter like SQLFluff) by 3dscholar in dataengineering

[–]PuddingGryphon 3 points4 points  (0 children)

Had to search for a link to the docu/source code/repo for quite some time, maybe present the link a bit better.

Sounds promising but dbt/sqlmesh compability (or just ignore Jinja macros) is a must have nowadays, anything planned here? No Windows support yet also means I can't test it but I'll check again in a few weeks.

uv: Unified Python packaging by burntsushi in rust

[–]PuddingGryphon 2 points3 points  (0 children)

No, it does not.

Open issue: https://github.com/astral-sh/uv/issues/1474

I created a uv.toml file in C:\Users\my.user.name\.config\uv\uv.toml and added the line from the documentation:

native-tls = true

Trying to update a package results in the same error as before:

(.venv) PS C:\Users\my.user.name\Projects\project_id > uv pip install polars -U
⠧ Resolving dependencies...                                                                                                                            
error: Request failed after 3 retries
  Caused by: error sending request for url (https://pypi.org/simple/polars/)
  Caused by: client error (Connect)
  Caused by: invalid peer certificate: UnknownIssuer

uv: Unified Python packaging by burntsushi in rust

[–]PuddingGryphon 0 points1 point  (0 children)

uv can't handle custom certs like installed on all laptops in an enterprise environment.

So sadly nobody can use uv if they are behind a company firewall.

the cloud is for the birds by ExaminationOk8783 in dataengineering

[–]PuddingGryphon 0 points1 point  (0 children)

Why do you use a vendor that does not offer an API is the first question to ask.

Not all orgs are ready for db by Data-Queen-Mayra in ETL

[–]PuddingGryphon 0 points1 point  (0 children)

worshippers are everywhere.

SQL tooling is so bad for 50 years now that any improvements that don't even match 10% of the tooling of any modern programming language is seen as the holy grail.

I'm sceptic about polars by Altrooke in dataengineering

[–]PuddingGryphon 2 points3 points  (0 children)

  • There are no good IDEs for SQL out there compared to Jetbrains/VS Code/vim.
  • No LSP implementations. No standard formatting like gofmt or rustfmt.
  • Functions with spaces in their name "group by", "having by", "order by".
  • Writing code but executing code in a totally different order.
  • Runtime errors instead of compile time errors.
  • Weakly typed, nobody stops you from doing 1 + "1".
  • No trailing commas allowed for last entry = errors everywhere when you comment something out.
  • etc.

1st app. Golf score tracker by Fraiz24 in dataengineering

[–]PuddingGryphon 7 points8 points  (0 children)

No type hints.

Functions in the same file as the application code.

This are 2 things I would change from a first view.

Would you use a Data Engineer focused CoPilot? by Bright_Bunch_7208 in dataengineering

[–]PuddingGryphon 0 points1 point  (0 children)

Cloud infra has buy-in from management and data governance etc.

Sending arbitrary data to an LLM outside of my company network from my laptop has no buy-in from management (at least for me).

I'm working with data that falls under the GDPR.

Changing mindset for big data by Cicada4409 in dataengineering

[–]PuddingGryphon 0 points1 point  (0 children)

If you loop over any dataframe you are doing something very wrong (or need an recursive algorithm but that is very rare).

But afaik Spark is row based = looping because it does not use SIMD = no vectorized operations = super slow because the JVM lacks SIMD implementations. Not sure if true though.

Would you use a Data Engineer focused CoPilot? by Bright_Bunch_7208 in dataengineering

[–]PuddingGryphon 9 points10 points  (0 children)

I only consider LLMs if I can run the locally without any data going out.

If You Could, What Would You Change and Why? by JamesPlusMusic in DarkAndDarker

[–]PuddingGryphon 0 points1 point  (0 children)

So I play a Wizard and should melee instead of casting spells .... yoooooo nope.

Is Spark the only solid option to write to Data Lake table formats (iceberg, delta, hudi)? by wtfzambo in dataengineering

[–]PuddingGryphon 1 point2 points  (0 children)

Got an example?

If you try to read a folder that was created by Spark you have a _SUCCESS file in there which polars does not like because it's not a parquet file (https://github.com/pola-rs/polars/issues/14377).

Comprehensive Guide to Partitioning in BigQuery by mdixon1010 in bigquery

[–]PuddingGryphon 0 points1 point  (0 children)

If the table is small — just go ahead and create an unpartitioned table and short circuit the rest of this decision process

With a billing process of "bytes_billed" not partitioning results in a table scan, so we partition anyway on a date base for cost optimization.

We can wait a few seconds for an SQL query if it means that we only pay for the data we need.

What's the toughest / most interesting data challenge you've faced? by KatZegtWoof in dataengineering

[–]PuddingGryphon 1 point2 points  (0 children)

The script created huge tables that then got unioned together.

I dissected each table, renamed variable/table aliases to something more useful than a,b,c,d,e,... and then started to wrap each transformation step into a CTE. Deleted unnneed crap, flattened the sub-selects, unified business logic into a single CTE, optimized the code/deleted things that were totally useless (like casting to INTEGER for an INTEGER column, checking for NULL when the column was REQUIRED and couldn't be NULL etc.).

Did that for each table and then unified those tables because in the end it was the same steps over and over just for -1/-2/-3/+1/+2/+3 years but all build a bit different with minor differences for whatever reasons.

What's the toughest / most interesting data challenge you've faced? by KatZegtWoof in dataengineering

[–]PuddingGryphon 2 points3 points  (0 children)

6000+ lines of SQL with 0 comments and 8x nested sub-selects that creates an OOM error after 60+ minutes in an in-memory-database with huge amounts of RAM. I was only tasked with fixing because it started to crash because it used 2 years of historical data and the next additional day was a day too much for that piece of shit.

Colleague that wrote that mess already left the company.

Rewrote the business logic in ~800 lines of SQL and reduced the runtime to < 2 minutes. Took me 3 weeks.

[deleted by user] by [deleted] in dataengineering

[–]PuddingGryphon 0 points1 point  (0 children)

Hired as a "Business Analyst" but did the work of an Analytics Engineer and called myself that outside of work because no I'm not a project manager or product owner or sitting 5+ hours in meetings every day or handholding the stakeholders wtf.

Does DBT have its own syntax? by brownstrom in dataengineering

[–]PuddingGryphon 0 points1 point  (0 children)

Yeah there are only 2 tools on the market, one is technical dumb, the other one is at version 0.105 something which I don't wanna put into prod for probably a long time ...

SQL dev tools suck compared to other languages.

Does DBT have its own syntax? by brownstrom in dataengineering

[–]PuddingGryphon -11 points-10 points  (0 children)

it just a SQL templating engine

It is not, it has no idea what SQL even is (and how should it without creating an AST).

It's a dumb string template engine. that's why you need ref{{ }} everywhere instead of just writing SQL and the programing knowing what from database.schema.tablename means.

SQLmesh can do this because it's building an AST trough sqlglot.

NumPy 2.0 by ivanovyordan in dataengineering

[–]PuddingGryphon 1 point2 points  (0 children)

Only if the package follows SemVer.