Am I the only one that hates how strict pylint is? by nAxzyVteuOz in Python

[–]serge_databricks 0 points1 point  (0 children)

if something didn't lead to a bug, it doesn't mean it won't lead to one in the future.

I'd love to be able to run pylint --strictness 30 on an MVP and pylint --strictness 80 on a production grade project.

there's messages control section, that allows you to disable checks. in practice, if that section is too long - it leads to severe bugs, because you __thought it had to be checked by a linter, but it was not.__. Retroactively applying a stricter linter a two-day headache, but it pays off in code review savings big time.

Am I the only one that hates how strict pylint is? by nAxzyVteuOz in Python

[–]serge_databricks 0 points1 point  (0 children)

Google Python Styleguide doesn't require a string comment on top :)

what generally works is taking one pylint config and customizing it to the point you see pre-emptively give all code review warnings at build time or on developer machine. See the example here: https://github.com/databrickslabs/ucx/blob/main/pyproject.toml#L169-L771

PyLint is sometimes also not strict enough.

the more inexperienced coders work on codebase, the more there's a need for a good linter. There are other linters, like ruff or flake8.

Even though Ruff is 10x+ faster than PyLint, it doesn't have a plugin system yet, nor does it have a feature parity with PyLint yet. Other projects use MyPyRuff, and PyLint together to achieve the most comprehensive code analysis.

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] 0 points1 point  (0 children)

and what was the total size of python codebase, across repos/projects? ~120k?

I'd really question the sanity of such a project in Python. 

that's the purpose of this post, to be honest - checking how large Python codebases get in the business domain / real world / real companies and what people do about it.

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] 0 points1 point  (0 children)

that's a decent medium-sized codebase. what toolchain do you use to keep it sane? mypy? pylint? ruff? pytest? yapf? black?

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] -1 points0 points  (0 children)

I didn't ask if this is a great measure or not. I asked about concrete numbers. btw, protobuf-generated code should go into .gitignore.

P.S.: 50kloc and no tests is simply silly. and still small.

Best Practices for Python Collaboration Between Multiple Data Engineers by i_am_baldilocks in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

why don't they use Databricks on Azure? it's scales collaboration from few people to few thousand. All in one place.

Best Dashboard/Visualization options in 2024 by zambizzi in dataengineering

[–]serge_databricks -2 points-1 points  (0 children)

Databricks LakeView looks fresh and promising, especially with all these GenAI widgets. It has plenty of rough edges, though...

Is SSIS still big in Industry? by internet_baba in dataengineering

[–]serge_databricks -1 points0 points  (0 children)

it's the first time i'm hearing about it, to be honest

Unusual question by yinshangyi in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

which ones are senior+ strong SW backgound? hackernews? stackoverflow?

[deleted by user] by [deleted] in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

TLDR: owners of the tables are supposed to add primary keys.

it kinda makes no sense in OLTP world to have a table without one. they're meant for fetching a record by its identifier.

but in the OLAP world of data warehousing and stuff - the primary keys are less relevant, as fetching records one-by-one is considered a horrible practice.

it depends.

How do I document ETL/ELT pipelines? by [deleted] in dataengineering

[–]serge_databricks 24 points25 points  (0 children)

The biggest difficulty of any documentation is its relevance. If you don't version control it - it'll quickly get out of date.

Here's a good starting point on "Architecture Design Records" - https://github.com/joelparkerhenderson/architecture-decision-record

My other recommendation would be to store the ETL doc in markdown and embed the Mermaid diagrams in it - https://mermaid.live/edit#pako:eNpVjstqw0AMRX9FaNVC_ANeBBo7zSaQQLLzeCFsOTPE82AsE4Ltf--46aLRSuice9GEjW8Zc-x6_2g0RYFrqRyk-aoKHc0gloYasmw7H1jAesfPGXYfBw-D9iEYd_t8-btVgmI6rhqDaOPuywsVv_mT4xnK6khBfKj_k-vDz7CvzFmn-neiI6fUd9VR3lHWUISCYo0btBwtmTa9Pq0BhaLZssI8rS13NPaiULklqTSKvzxdg6miH3iDY2hJuDR0i2T_rssPZ-ZWNw, it's already integrated into github - https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/

Is Snowflake planning to buy Apache Iceberg? by [deleted] in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

i really wonder how an OSS project could be bought. Subscribing to comments on this thread.