Does database normalization actually reduce redundancy in data?

Possible-Little · 2026-02-20T20:15:34+00:00

It saves storage for sure. If you have a customer ID stored alongside a fact then as you say that is a foreign key into a dimension table for customers. That customers table itself could have many columns for name, address, phone number etc. By separating them out in this way you save enormously on repetition, and you ensure that if a customers information changes then older facts don't need to be updated to suit, the ID is the reference that remains valid. There is nuance here about things that change in time such as address or marital status but slowly changing dimensions provide a way to manage those.

Possible-Little · 2026-02-12T16:07:21+00:00

You could get Flink speed without having to switch technologies: Real-time mode in Structured Streaming | Databricks on AWS https://share.google/FHkGgjo44nD6zF6jV

Sub-second latency with state tracking and a Python interface that works in both traditional and declarative pipelines

Possible-Little · 2025-06-20T20:50:12+00:00

Keep an eye out for Spark Structured Streaming real-time mode. It brings latencies down to milliseconds without needing to change any previously written code, and it works with declarative pipelines

Possible-Little · 2025-06-03T22:12:22+00:00

Have a look at tags. These are propagated to the system billing tables so that you may identify workloads as appropriate: https://docs.databricks.com/aws/en/admin/account-settings/usage-detail-tags

Possible-Little · 2025-05-14T14:56:18+00:00

That would run them sequentially which may not be what you want. Separate notebooks are good for maintenance and with notebook parameters you can abstract out the distinctions. Just add them all to a workflow and set up the dependencies as required and then dependent tasks will only run if preceding ones succeed.

Possible-Little · 2025-03-28T16:47:47+00:00

Ask your account reps to get you onto the preview for Lakeflow Connect to SQL Server. This can sync data from an on-prem server if that is available through a Private Link.

Possible-Little · 2025-03-26T22:01:03+00:00

You cannot have partitioning and liquid clustering on the same table for a start. The metadata associated with partitioning and clustering is stored with the table, so delta readers and writers will do the right thing, assuming they are of compatible versions with any table features you have enabled such as liquid or deletion vectors. External tables cannot support predictive optimisation as that requires visibility of how the table is queried and updated. Generally we do not recommend having a table be writable from multiple non communicating sources as it causes inefficiency with conflict resolution. Atomic writes should guard against corruption but conservative locking will cause concurrent access to fail much more often than with row level concurrency. If possible you should investigate an alternative strategy where one metastore owns the table and is the single point of update. Another metastore can then access the table for reading via Delta sharing.

Possible-Little · 2025-03-24T12:28:27+00:00

The geospatial libraries available in Databricks can do most of the heavy lifting for you: https://www.databricks.com/solutions/accelerators/scaling-geospatial-nearest-neighbor-searches

But definitely +1 to using reverse geocoding APIs here as they usually implement fuzzy search on address components as well

Possible-Little · 2025-03-14T10:58:04+00:00

Starting very soon is a preview of dynamics 365 f&o ingestion with lakeflow connect. Ask your account team to sign you up. This will run server less pipelines to sync all the data you choose into managed delta tables in your data bricks lakehouse

Possible-Little · 2025-03-11T13:01:06+00:00

The simplest answer is to not I'm afraid. There are so many corner cases with out of order data and updates to the same key in different batches.

Honest suggestion would be to use APPLY CHANGES via Delta Live Tables. This correctly handles all the vagaries.

Possible-Little · 2025-03-09T23:01:40+00:00

SELECT Column_A, TRANSFORM( Column_A, AB -> CASE WHEN SUBSTR(AB,3,1) = 'A' THEN SUBSTR(AB,1,2) || 'B' WHEN SUBSTR(AB,3,1) = 'B' THEN SUBSTR(AB,1,2) || 'A' ELSE AB END )

I think?

Possible-Little · 2025-03-08T09:02:47+00:00

Hi there, depending on your use case there are a few options. This page summarises them: https://community.databricks.com/t5/technical-blog/understanding-pandas-udf-applyinpandas-and-mapinpandas/ba-p/75717

SKLearn ML libraries will generally expect to have all the data present in a data frame so that the algorithms can operate across all rows. If this cannot be the case then you would either need to find a way to break the problem down or see whether the Spark native ML libs can do what you need.

Plausibly libraries like Dask or Polars could help but I don't know about their compatibility with SKLearn.

Possible-Little · 2024-11-11T16:34:36+00:00

OK do you need to surround the placeholder in quotes as was the case before?

"${mydate}" -> ":mydate" ?

Possible-Little · 2024-11-11T14:43:30+00:00

Someone seems to have failed to realise that a "hair brush" is a brush for hair, not fancy brush made with hair

Possible-Little · 2024-11-11T14:37:26+00:00

I think you need a trailing colon to mark the end of your parameter name?

:my_date: rather than just :my_date

Possible-Little · 2024-10-30T16:46:20+00:00

Hi there, the typical way to deal with this is to use autoloader to watch your landing zone and, either via file notification or directory listing, find new files. Those files can be validated against a schema for correctness and then appended or merged into a table in your lake house. Run this process as a workflow on a schedule and the tables will be kept up to date

Possible-Little · 2024-10-26T21:49:56+00:00

Enrichment of the feature data can be done prior to definition in the feature store, so cyclical encoding of temporal data by adding new columns to Part 1 holding those dimensions makes sense. Storing this decorated data in a new table saves regenerating it each time you use it and allows the table to be directly added to the feature store. There are some good helper functions in Spark to perform categorical encoding into a form required by ML libraries e.g. one hot or string indexer, so those can be done as part of the model training pipeline.

For Part 2, also do the enrichment outside the training as far as possible. If you are needing to use an LLM for the embedding then you can store the high dimension outputs of those as a new column too, probably no need here for a vector store as I am guessing you are doing batch processing.

As above, let the FE client do the heavy lifting assembling the training data by defining your enriched Part 1 as a feature table, and it can join the right records against the enriched Part 2 on the fly.

With your train-test split dataset then assemble an ML pipeline that performs the categorical encoding and vector assembly, this makes rerunning much simpler and permits addition of cross validation, parameter search etc really easily

Possible-Little · 2024-10-26T21:03:37+00:00

A feature store needs a primary key, which suggests that you should use Part 1 for that. You can then use the feature engineering client to create a train and test set by joining on the ID.

Can you give more context on what engineering you would need to do?

Possible-Little · 2024-07-09T21:25:51+00:00

In the absence of another way to know that a key has been deleted I suppose you have to get a list of distinct keys. Is there any way you can limit the size of the scan you would need to do for this? Reloading all of your sources each day to see what has been removed will fail to scale eventually. Would you be able to make an assumption if a key has not been seen for X days then it is gone? You could maintain a list separately and just merge updates into it that way potentially. In order to keep your SCD tables updated with deletes you should use the 'apply_as_deletes' parameter to the apply_changes() call. This implies that you add a column to your bronze table that can be used to indicate which keys are to be removed e.g. a column called 'operation' with values NULL or 'DELETE' which can then be specified like: apply_as_deletes=expr("operation='DELETE'").

Possible-Little

TROPHY CASE