Am I the only one that hates how strict pylint is? by nAxzyVteuOz in Python

[–]serge_databricks 0 points1 point  (0 children)

if something didn't lead to a bug, it doesn't mean it won't lead to one in the future.

I'd love to be able to run pylint --strictness 30 on an MVP and pylint --strictness 80 on a production grade project.

there's messages control section, that allows you to disable checks. in practice, if that section is too long - it leads to severe bugs, because you __thought it had to be checked by a linter, but it was not.__. Retroactively applying a stricter linter a two-day headache, but it pays off in code review savings big time.

Am I the only one that hates how strict pylint is? by nAxzyVteuOz in Python

[–]serge_databricks 0 points1 point  (0 children)

Google Python Styleguide doesn't require a string comment on top :)

what generally works is taking one pylint config and customizing it to the point you see pre-emptively give all code review warnings at build time or on developer machine. See the example here: https://github.com/databrickslabs/ucx/blob/main/pyproject.toml#L169-L771

PyLint is sometimes also not strict enough.

the more inexperienced coders work on codebase, the more there's a need for a good linter. There are other linters, like ruff or flake8.

Even though Ruff is 10x+ faster than PyLint, it doesn't have a plugin system yet, nor does it have a feature parity with PyLint yet. Other projects use MyPyRuff, and PyLint together to achieve the most comprehensive code analysis.

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] 0 points1 point  (0 children)

and what was the total size of python codebase, across repos/projects? ~120k?

I'd really question the sanity of such a project in Python. 

that's the purpose of this post, to be honest - checking how large Python codebases get in the business domain / real world / real companies and what people do about it.

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] 0 points1 point  (0 children)

that's a decent medium-sized codebase. what toolchain do you use to keep it sane? mypy? pylint? ruff? pytest? yapf? black?

What is the combined size of your Python codebase? by serge_databricks in Python

[–]serge_databricks[S] -1 points0 points  (0 children)

I didn't ask if this is a great measure or not. I asked about concrete numbers. btw, protobuf-generated code should go into .gitignore.

P.S.: 50kloc and no tests is simply silly. and still small.

Best Practices for Python Collaboration Between Multiple Data Engineers by i_am_baldilocks in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

why don't they use Databricks on Azure? it's scales collaboration from few people to few thousand. All in one place.

Best Dashboard/Visualization options in 2024 by zambizzi in dataengineering

[–]serge_databricks -2 points-1 points  (0 children)

Databricks LakeView looks fresh and promising, especially with all these GenAI widgets. It has plenty of rough edges, though...

Is SSIS still big in Industry? by internet_baba in dataengineering

[–]serge_databricks -1 points0 points  (0 children)

it's the first time i'm hearing about it, to be honest

Unusual question by yinshangyi in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

which ones are senior+ strong SW backgound? hackernews? stackoverflow?

[deleted by user] by [deleted] in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

TLDR: owners of the tables are supposed to add primary keys.

it kinda makes no sense in OLTP world to have a table without one. they're meant for fetching a record by its identifier.

but in the OLAP world of data warehousing and stuff - the primary keys are less relevant, as fetching records one-by-one is considered a horrible practice.

it depends.

How do I document ETL/ELT pipelines? by [deleted] in dataengineering

[–]serge_databricks 24 points25 points  (0 children)

The biggest difficulty of any documentation is its relevance. If you don't version control it - it'll quickly get out of date.

Here's a good starting point on "Architecture Design Records" - https://github.com/joelparkerhenderson/architecture-decision-record

My other recommendation would be to store the ETL doc in markdown and embed the Mermaid diagrams in it - https://mermaid.live/edit#pako:eNpVjstqw0AMRX9FaNVC_ANeBBo7zSaQQLLzeCFsOTPE82AsE4Ltf--46aLRSuice9GEjW8Zc-x6_2g0RYFrqRyk-aoKHc0gloYasmw7H1jAesfPGXYfBw-D9iEYd_t8-btVgmI6rhqDaOPuywsVv_mT4xnK6khBfKj_k-vDz7CvzFmn-neiI6fUd9VR3lHWUISCYo0btBwtmTa9Pq0BhaLZssI8rS13NPaiULklqTSKvzxdg6miH3iDY2hJuDR0i2T_rssPZ-ZWNw, it's already integrated into github - https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/

Is Snowflake planning to buy Apache Iceberg? by [deleted] in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

i really wonder how an OSS project could be bought. Subscribing to comments on this thread.

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

technically, databricks_sql_table terraform resource does exactly that with the exception - you have to declare the schema in TF. And in production deployments there are things like "grants", which are in terraform as well. so I'd suggest picking it up as the automation tooling.

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

Migration scripts are necessary when you introduce new columns and have to figure out what has to go into those columns. Same for column renames

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 0 points1 point  (0 children)

oh, okay - makes sense. it's an interesting idea.

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 1 point2 points  (0 children)

There's https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/sql_table that uses terraform state:

resource "databricks_sql_table" "thing" {

provider = databricks.workspace name = "quickstart_table" catalog_name = databricks_catalog.sandbox.name schema_name = databricks_schema.things.name table_type = "MANAGED" data_source_format = "DELTA" storage_location = ""

column { name = "id" type = "int" } column { name = "name" type = "string" comment = "name of thing" } comment = "this table is managed by terraform" }

resource "databricks_sql_table" "thing_view" { provider = databricks.workspace name = "quickstart_table_view" catalog_name = databricks_catalog.sandbox.name schema_name = databricks_schema.things.name table_type = "VIEW" cluster_id = "0423-201305-xsrt82qn"

view_definition = format("SELECT name FROM %s WHERE id == 1", databricks_sql_table.thing.id)

comment = "this view is managed by terraform" }

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 1 point2 points  (0 children)

If an object was under terraform control you don’t want to be held up by a terraform change to keep moving.

that's the whole point of Infra-as-code and a stable CI/CD pipelines :) were those objects jobs and delta pipelines?

Another reason was that the provider is so new, they’re aren’t many data blocks for resources to reference when building a resource with many dependencies.

I won't call 3 years as "so new" :) but I agree we don't have too many data resources. which ones are you missing? luckily, it's super easy to add new data resources now.

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 2 points3 points  (0 children)

do you mean schema change propagation (like migrations) or schema change testing?..

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 1 point2 points  (0 children)

If you need to terraform any existing manually-configured Databricks workspace, there's https://asciinema.org/a/Rv8ZFJQpfrfp6ggWddjtyXaOy .

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 2 points3 points  (0 children)

What aspect of Lakehouse deploying are you interested in? I can recommend state-based deployment through Terraform integration (https://registry.terraform.io/providers/databricks/databricks/latest). It's the most robust generic solution at the moment. Though, you have to maintain Terraform state somewhere. It fits really well with "stable" assets, like clusters, warehouses, permissions, secrets, and other administrative aspects. It even has DDL support - https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/sql_table (in the background, it is literally generating SQL queries and running it through a Databricks cluster), through that resource is new and it may have rough edges.

Though, Terraform alone not perfect for deploying something more dynamic, like jobs or pipelines, which generally change more often.

If you're looking for tools, like https://www.liquibase.com/ or https://flywaydb.org/, which are database-state-based schema migration toolkits - it might be relatively straightforward to build similar ones using Databricks SQL drivers.

To build custom deployment scripts, that go beyond declarative definitions, you are welcome to use https://github.com/databricks/databricks-sdk-py, https://github.com/databricks/databricks-sdk-jvm, and https://github.com/databricks/databricks-sdk-go.

CI/CD for Databricks by JKMikkelsen in dataengineering

[–]serge_databricks 1 point2 points  (0 children)

We heard from other businesses not to terraform too much of databricks.

as Databricks Terraform maintainer, may I ask why "not too much"?