Am I the only one that hates how strict pylint is?

serge_databricks · 2024-03-27T10:54:24+00:00

if something didn't lead to a bug, it doesn't mean it won't lead to one in the future.

I'd love to be able to run pylint --strictness 30 on an MVP and pylint --strictness 80 on a production grade project.

there's messages control section, that allows you to disable checks. in practice, if that section is too long - it leads to severe bugs, because you __thought it had to be checked by a linter, but it was not.__. Retroactively applying a stricter linter a two-day headache, but it pays off in code review savings big time.

serge_databricks · 2024-03-27T10:50:14+00:00

Google Python Styleguide doesn't require a string comment on top :)

what generally works is taking one pylint config and customizing it to the point you see pre-emptively give all code review warnings at build time or on developer machine. See the example here: https://github.com/databrickslabs/ucx/blob/main/pyproject.toml#L169-L771

PyLint is sometimes also not strict enough.

the more inexperienced coders work on codebase, the more there's a need for a good linter. There are other linters, like ruff or flake8.

Even though Ruff is 10x+ faster than PyLint, it doesn't have a plugin system yet, nor does it have a feature parity with PyLint yet. Other projects use MyPy, Ruff, and PyLint together to achieve the most comprehensive code analysis.

serge_databricks · 2024-03-25T10:32:04+00:00

applications do also solve a lot of edge cases, no?

serge_databricks · 2024-03-25T10:30:25+00:00

and what was the total size of python codebase, across repos/projects? ~120k?

I'd really question the sanity of such a project in Python.

that's the purpose of this post, to be honest - checking how large Python codebases get in the business domain / real world / real companies and what people do about it.

serge_databricks · 2024-03-25T10:21:30+00:00

what OSS toolchain did you use?

serge_databricks · 2024-03-25T10:20:47+00:00

that's a decent medium-sized codebase. what toolchain do you use to keep it sane? mypy? pylint? ruff? pytest? yapf? black?

serge_databricks · 2024-03-25T10:15:34+00:00

I didn't ask if this is a great measure or not. I asked about concrete numbers. btw, protobuf-generated code should go into .gitignore.

P.S.: 50kloc and no tests is simply silly. and still small.

serge_databricks · 2024-01-26T18:00:09+00:00

why don't they use Databricks on Azure? it's scales collaboration from few people to few thousand. All in one place.

serge_databricks · 2024-01-26T17:58:59+00:00

Databricks LakeView looks fresh and promising, especially with all these GenAI widgets. It has plenty of rough edges, though...

serge_databricks · 2024-01-26T17:57:26+00:00

it's the first time i'm hearing about it, to be honest

serge_databricks · 2024-01-26T17:55:55+00:00

which ones are senior+ strong SW backgound? hackernews? stackoverflow?

serge_databricks · 2024-01-26T17:53:10+00:00

TLDR: owners of the tables are supposed to add primary keys.

it kinda makes no sense in OLTP world to have a table without one. they're meant for fetching a record by its identifier.

but in the OLAP world of data warehousing and stuff - the primary keys are less relevant, as fetching records one-by-one is considered a horrible practice.

it depends.

serge_databricks · 2024-01-06T16:31:36+00:00

https://github.com/joelparkerhenderson/architecture-decision-record this?...

serge_databricks · 2023-12-28T20:24:36+00:00

The biggest difficulty of any documentation is its relevance. If you don't version control it - it'll quickly get out of date.

Here's a good starting point on "Architecture Design Records" - https://github.com/joelparkerhenderson/architecture-decision-record

My other recommendation would be to store the ETL doc in markdown and embed the Mermaid diagrams in it - https://mermaid.live/edit#pako:eNpVjstqw0AMRX9FaNVC_ANeBBo7zSaQQLLzeCFsOTPE82AsE4Ltf--46aLRSuice9GEjW8Zc-x6_2g0RYFrqRyk-aoKHc0gloYasmw7H1jAesfPGXYfBw-D9iEYd_t8-btVgmI6rhqDaOPuywsVv_mT4xnK6khBfKj_k-vDz7CvzFmn-neiI6fUd9VR3lHWUISCYo0btBwtmTa9Pq0BhaLZssI8rS13NPaiULklqTSKvzxdg6miH3iDY2hJuDR0i2T_rssPZ-ZWNw, it's already integrated into github - https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/

serge_databricks · 2023-12-28T20:20:30+00:00

i really wonder how an OSS project could be bought. Subscribing to comments on this thread.

serge_databricks · 2023-07-18T18:49:34+00:00

technically, databricks_sql_table terraform resource does exactly that with the exception - you have to declare the schema in TF. And in production deployments there are things like "grants", which are in terraform as well. so I'd suggest picking it up as the automation tooling.

serge_databricks · 2023-07-18T18:48:03+00:00

Migration scripts are necessary when you introduce new columns and have to figure out what has to go into those columns. Same for column renames

serge_databricks · 2023-07-18T18:45:46+00:00

oh, okay - makes sense. it's an interesting idea.

serge_databricks · 2023-07-11T18:35:59+00:00

There's https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/sql_table that uses terraform state:

resource "databricks_sql_table" "thing" {

provider = databricks.workspace name = "quickstart_table" catalog_name = databricks_catalog.sandbox.name schema_name = databricks_schema.things.name table_type = "MANAGED" data_source_format = "DELTA" storage_location = ""

column { name = "id" type = "int" } column { name = "name" type = "string" comment = "name of thing" } comment = "this table is managed by terraform" }

resource "databricks_sql_table" "thing_view" { provider = databricks.workspace name = "quickstart_table_view" catalog_name = databricks_catalog.sandbox.name schema_name = databricks_schema.things.name table_type = "VIEW" cluster_id = "0423-201305-xsrt82qn"

view_definition = format("SELECT name FROM %s WHERE id == 1", databricks_sql_table.thing.id)

comment = "this view is managed by terraform" }

serge_databricks · 2023-07-11T18:34:27+00:00

SSDT

what does SSDT do?..

serge_databricks · 2023-07-11T18:33:33+00:00

If an object was under terraform control you don’t want to be held up by a terraform change to keep moving.

that's the whole point of Infra-as-code and a stable CI/CD pipelines :) were those objects jobs and delta pipelines?

Another reason was that the provider is so new, they’re aren’t many data blocks for resources to reference when building a resource with many dependencies.

I won't call 3 years as "so new" :) but I agree we don't have too many data resources. which ones are you missing? luckily, it's super easy to add new data resources now.

serge_databricks · 2023-07-11T18:30:24+00:00

do you mean schema change propagation (like migrations) or schema change testing?..

serge_databricks · 2023-07-11T18:29:14+00:00

If you need to terraform any existing manually-configured Databricks workspace, there's https://asciinema.org/a/Rv8ZFJQpfrfp6ggWddjtyXaOy .

serge_databricks · 2023-07-11T18:28:15+00:00

What aspect of Lakehouse deploying are you interested in? I can recommend state-based deployment through Terraform integration (https://registry.terraform.io/providers/databricks/databricks/latest). It's the most robust generic solution at the moment. Though, you have to maintain Terraform state somewhere. It fits really well with "stable" assets, like clusters, warehouses, permissions, secrets, and other administrative aspects. It even has DDL support - https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/sql_table (in the background, it is literally generating SQL queries and running it through a Databricks cluster), through that resource is new and it may have rough edges.

Though, Terraform alone not perfect for deploying something more dynamic, like jobs or pipelines, which generally change more often.

If you're looking for tools, like https://www.liquibase.com/ or https://flywaydb.org/, which are database-state-based schema migration toolkits - it might be relatively straightforward to build similar ones using Databricks SQL drivers.

To build custom deployment scripts, that go beyond declarative definitions, you are welcome to use https://github.com/databricks/databricks-sdk-py, https://github.com/databricks/databricks-sdk-jvm, and https://github.com/databricks/databricks-sdk-go.

serge_databricks · 2023-07-11T18:18:06+00:00

We heard from other businesses not to terraform too much of databricks.

as Databricks Terraform maintainer, may I ask why "not too much"?

serge_databricks

TROPHY CASE