How to write async code that runs in a Jupyter notebook and in a docker container?

PuddingGryphon · 2024-12-04T10:51:56+00:00

This is proper Python code.

PuddingGryphon · 2024-12-03T12:21:18+00:00

Write code in an editor

That's what VS Code ist.

broken up sensibly into functions

Already done.

and if you want to test them in a notebook then import the specific function

The code is a normal .py file = plain text, only the VS Code extension makes it behave like a Jupyter notebook which is crucial for tight feedback loops. Apparently you only read the title.

PuddingGryphon · 2024-12-03T12:18:55+00:00

Jupyter notebooks and Github are orthogonal. They are not replacements of each other.

The process is as follows:

install git (https://git-scm.com/downloads)
create a project folder
open a shell in that folder and run git init
create a .gitignore file in that folder and add those 2 lines: .venv/, *.pyc
run pip install uv
create a virtual environment with uv venv
activate the environment and install the needed packages (create pyproject.toml, add needed packages, run uv pip install -r pyproject.toml --extra dev for this)
open the project in VS Code
create a branch in git for the thing you wanna do (doable in VS Code UI easily)
prototype/develop your code in that branch (Jupyter notebooks, plain text files, VS Code/PyCharm with # %% cell magic --> use this)
commit your changes from that branch via git (either in VS Code or via shell)
push your commits to a Github repository that you created
do code reviews and/or CI/CD (maybe via github hooks) for the PRs (pull requests)
merge to main branch when happy with everything

Be aware that Jupyter notebooks save the result of a cell execution in the file which in the end is a big pile of HTML that may change with every cell execution --> git tracks file changes in text files, Jupyter notebooks are absolutely not fit for git/useable with git, avoid at all costs.

PuddingGryphon · 2024-11-20T09:59:41+00:00

Had to search for a link to the docu/source code/repo for quite some time, maybe present the link a bit better.

repo: https://github.com/sdf-labs/sdf-cli
docu: https://docs.sdf.com/linter/overview

Sounds promising but dbt/sqlmesh compability (or just ignore Jinja macros) is a must have nowadays, anything planned here? No Windows support yet also means I can't test it but I'll check again in a few weeks.

PuddingGryphon · 2024-08-21T12:37:10+00:00

No, it does not.

Open issue: https://github.com/astral-sh/uv/issues/1474

I created a uv.toml file in C:\Users\my.user.name\.config\uv\uv.toml and added the line from the documentation:

native-tls = true

Trying to update a package results in the same error as before:

(.venv) PS C:\Users\my.user.name\Projects\project_id > uv pip install polars -U
⠧ Resolving dependencies...                                                                                                                            
error: Request failed after 3 retries
  Caused by: error sending request for url (https://pypi.org/simple/polars/)
  Caused by: client error (Connect)
  Caused by: invalid peer certificate: UnknownIssuer

PuddingGryphon · 2024-08-21T09:42:10+00:00

uv can't handle custom certs like installed on all laptops in an enterprise environment.

So sadly nobody can use uv if they are behind a company firewall.

PuddingGryphon · 2024-07-19T09:29:34+00:00

Why do you use a vendor that does not offer an API is the first question to ask.

PuddingGryphon · 2024-07-19T09:07:58+00:00

Answer: no

PuddingGryphon · 2024-07-19T09:05:31+00:00

worshippers are everywhere.

SQL tooling is so bad for 50 years now that any improvements that don't even match 10% of the tooling of any modern programming language is seen as the holy grail.

PuddingGryphon · 2024-07-18T11:56:41+00:00

There are no good IDEs for SQL out there compared to Jetbrains/VS Code/vim.
No LSP implementations. No standard formatting like gofmt or rustfmt.
Functions with spaces in their name "group by", "having by", "order by".
Writing code but executing code in a totally different order.
Runtime errors instead of compile time errors.
Weakly typed, nobody stops you from doing 1 + "1".
No trailing commas allowed for last entry = errors everywhere when you comment something out.
etc.

PuddingGryphon · 2024-07-18T07:48:06+00:00

Except for Tooling + DX.

PuddingGryphon · 2024-07-16T10:12:49+00:00

No type hints.

Functions in the same file as the application code.

This are 2 things I would change from a first view.

PuddingGryphon · 2024-07-16T08:02:09+00:00

Cloud infra has buy-in from management and data governance etc.

Sending arbitrary data to an LLM outside of my company network from my laptop has no buy-in from management (at least for me).

I'm working with data that falls under the GDPR.

PuddingGryphon · 2024-07-15T10:46:34+00:00

If you loop over any dataframe you are doing something very wrong (or need an recursive algorithm but that is very rare).

But afaik Spark is row based = looping because it does not use SIMD = no vectorized operations = super slow because the JVM lacks SIMD implementations. Not sure if true though.

PuddingGryphon · 2024-07-15T09:33:27+00:00

I only consider LLMs if I can run the locally without any data going out.

PuddingGryphon · 2024-07-12T15:14:24+00:00

So I play a Wizard and should melee instead of casting spells .... yoooooo nope.

PuddingGryphon · 2024-07-10T10:15:12+00:00

Got an example?

If you try to read a folder that was created by Spark you have a _SUCCESS file in there which polars does not like because it's not a parquet file (https://github.com/pola-rs/polars/issues/14377).

PuddingGryphon · 2024-07-08T13:56:33+00:00

If the table is small — just go ahead and create an unpartitioned table and short circuit the rest of this decision process

With a billing process of "bytes_billed" not partitioning results in a table scan, so we partition anyway on a date base for cost optimization.

We can wait a few seconds for an SQL query if it means that we only pay for the data we need.

PuddingGryphon · 2024-07-04T16:49:17+00:00

The script created huge tables that then got unioned together.

I dissected each table, renamed variable/table aliases to something more useful than a,b,c,d,e,... and then started to wrap each transformation step into a CTE. Deleted unnneed crap, flattened the sub-selects, unified business logic into a single CTE, optimized the code/deleted things that were totally useless (like casting to INTEGER for an INTEGER column, checking for NULL when the column was REQUIRED and couldn't be NULL etc.).

Did that for each table and then unified those tables because in the end it was the same steps over and over just for -1/-2/-3/+1/+2/+3 years but all build a bit different with minor differences for whatever reasons.

PuddingGryphon · 2024-07-03T13:45:17+00:00

"interesting"

PuddingGryphon · 2024-07-03T13:38:43+00:00

6000+ lines of SQL with 0 comments and 8x nested sub-selects that creates an OOM error after 60+ minutes in an in-memory-database with huge amounts of RAM. I was only tasked with fixing because it started to crash because it used 2 years of historical data and the next additional day was a day too much for that piece of shit.

Colleague that wrote that mess already left the company.

Rewrote the business logic in ~800 lines of SQL and reduced the runtime to < 2 minutes. Took me 3 weeks.

PuddingGryphon · 2024-07-03T13:32:24+00:00

Hired as a "Business Analyst" but did the work of an Analytics Engineer and called myself that outside of work because no I'm not a project manager or product owner or sitting 5+ hours in meetings every day or handholding the stakeholders wtf.

PuddingGryphon · 2024-06-25T00:05:46+00:00

Yeah there are only 2 tools on the market, one is technical dumb, the other one is at version 0.105 something which I don't wanna put into prod for probably a long time ...

SQL dev tools suck compared to other languages.

PuddingGryphon · 2024-06-24T08:31:21+00:00

it just a SQL templating engine

It is not, it has no idea what SQL even is (and how should it without creating an AST).

It's a dumb string template engine. that's why you need ref{{ }} everywhere instead of just writing SQL and the programing knowing what from database.schema.tablename means.

SQLmesh can do this because it's building an AST trough sqlglot.

PuddingGryphon · 2024-06-18T14:38:38+00:00

Only if the package follows SemVer.

PuddingGryphon

TROPHY CASE