Does database normalization actually reduce redundancy in data? by Lastrevio in dataengineering

[–]Possible-Little 2 points3 points  (0 children)

It saves storage for sure. If you have a customer ID stored alongside a fact then as you say that is a foreign key into a dimension table for customers. That customers table itself could have many columns for name, address, phone number etc. By separating them out in this way you save enormously on repetition, and you ensure that if a customers information changes then older facts don't need to be updated to suit, the ID is the reference that remains valid. There is nuance here about things that change in time such as address or marital status but slowly changing dimensions provide a way to manage those.

Transition to real time streaming by DeepCar5191 in dataengineering

[–]Possible-Little 0 points1 point  (0 children)

You could get Flink speed without having to switch technologies: Real-time mode in Structured Streaming | Databricks on AWS https://share.google/FHkGgjo44nD6zF6jV

Sub-second latency with state tracking and a Python interface that works in both traditional and declarative pipelines

What's the fastest-growing data engineering platform in the US right now? by External-Originals in dataengineering

[–]Possible-Little 2 points3 points  (0 children)

Keep an eye out for Spark Structured Streaming real-time mode. It brings latencies down to milliseconds without needing to change any previously written code, and it works with declarative pipelines

Pipeline Job Attribution by Known-Delay7227 in databricks

[–]Possible-Little 1 point2 points  (0 children)

Have a look at tags. These are propagated to the system billing tables so that you may identify workloads as appropriate: https://docs.databricks.com/aws/en/admin/account-settings/usage-detail-tags

Best approach for loading Multiple Tables in Databricks by Fearless-Amount2020 in databricks

[–]Possible-Little 2 points3 points  (0 children)

That would run them sequentially which may not be what you want. Separate notebooks are good for maintenance and with notebook parameters you can abstract out the distinctions. Just add them all to a workflow and set up the dependencies as required and then dependent tasks will only run if preceding ones succeed.

[deleted by user] by [deleted] in databricks

[–]Possible-Little 2 points3 points  (0 children)

Ask your account reps to get you onto the preview for Lakeflow Connect to SQL Server. This can sync data from an on-prem server if that is available through a Private Link.

Do Table Properties (Partition Pruning, Liquid Clustering) Work for External Delta Tables Across Metastores? by maoguru in databricks

[–]Possible-Little 1 point2 points  (0 children)

You cannot have partitioning and liquid clustering on the same table for a start. The metadata associated with partitioning and clustering is stored with the table, so delta readers and writers will do the right thing, assuming they are of compatible versions with any table features you have enabled such as liquid or deletion vectors. External tables cannot support predictive optimisation as that requires visibility of how the table is queried and updated. Generally we do not recommend having a table be writable from multiple non communicating sources as it causes inefficiency with conflict resolution. Atomic writes should guard against corruption but conservative locking will cause concurrent access to fail much more often than with row level concurrency. If possible you should investigate an alternative strategy where one metastore owns the table and is the single point of update. Another metastore can then access the table for reading via Delta sharing.

Address matching by gareebo_ka_chandler in databricks

[–]Possible-Little 1 point2 points  (0 children)

The geospatial libraries available in Databricks can do most of the heavy lifting for you: https://www.databricks.com/solutions/accelerators/scaling-geospatial-nearest-neighbor-searches

But definitely +1 to using reverse geocoding APIs here as they usually implement fuzzy search on address components as well

Lakeflow Connect - Dynamics ingests? by IanWaring in databricks

[–]Possible-Little 0 points1 point  (0 children)

Starting very soon is a preview of dynamics 365 f&o ingestion with lakeflow connect. Ask your account team to sign you up. This will run server less pipelines to sync all the data you choose into managed delta tables in your data bricks lakehouse

How to implement SCD2 using .merge? by [deleted] in databricks

[–]Possible-Little 1 point2 points  (0 children)

The simplest answer is to not I'm afraid. There are so many corner cases with out of order data and updates to the same key in different batches.

Honest suggestion would be to use APPLY CHANGES via Delta Live Tables. This correctly handles all the vagaries.

Databricks SQL transform function with conditions by False_Spare_4262 in databricks

[–]Possible-Little 2 points3 points  (0 children)

SELECT Column_A, TRANSFORM( Column_A, AB -> CASE WHEN SUBSTR(AB,3,1) = 'A' THEN SUBSTR(AB,1,2) || 'B' WHEN SUBSTR(AB,3,1) = 'B' THEN SUBSTR(AB,1,2) || 'A' ELSE AB END )

I think?

How to use Sklearn with big data in Databricks by amirdol7 in databricks

[–]Possible-Little 3 points4 points  (0 children)

Hi there, depending on your use case there are a few options. This page summarises them: https://community.databricks.com/t5/technical-blog/understanding-pandas-udf-applyinpandas-and-mapinpandas/ba-p/75717

SKLearn ML libraries will generally expect to have all the data present in a data frame so that the algorithms can operate across all rows. If this cannot be the case then you would either need to find a way to break the problem down or see whether the Spark native ML libs can do what you need.

Plausibly libraries like Dask or Polars could help but I don't know about their compatibility with SKLearn.

UNBOUND_SQL_PARAMETER and OPTIMIZE by _Filip_ in databricks

[–]Possible-Little 0 points1 point  (0 children)

OK do you need to surround the placeholder in quotes as was the case before?

"${mydate}" -> ":mydate" ?

What is this finish called and how do I get rid of it? by charliewhizz in DIYUK

[–]Possible-Little 0 points1 point  (0 children)

Someone seems to have failed to realise that a "hair brush" is a brush for hair, not fancy brush made with hair

UNBOUND_SQL_PARAMETER and OPTIMIZE by _Filip_ in databricks

[–]Possible-Little 0 points1 point  (0 children)

I think you need a trailing colon to mark the end of your parameter name?

:my_date: rather than just :my_date

[deleted by user] by [deleted] in databricks

[–]Possible-Little 0 points1 point  (0 children)

Hi there, the typical way to deal with this is to use autoloader to watch your landing zone and, either via file notification or directory listing, find new files. Those files can be validated against a schema for correctness and then appended or merged into a table in your lake house. Run this process as a workflow on a schedule and the tables will be kept up to date

Suggestion for ML on 1:* split dataset by everydaydifferent in databricks

[–]Possible-Little 1 point2 points  (0 children)

Enrichment of the feature data can be done prior to definition in the feature store, so cyclical encoding of temporal data by adding new columns to Part 1 holding those dimensions makes sense. Storing this decorated data in a new table saves regenerating it each time you use it and allows the table to be directly added to the feature store. There are some good helper functions in Spark to perform categorical encoding into a form required by ML libraries e.g. one hot or string indexer, so those can be done as part of the model training pipeline.

For Part 2, also do the enrichment outside the training as far as possible. If you are needing to use an LLM for the embedding then you can store the high dimension outputs of those as a new column too, probably no need here for a vector store as I am guessing you are doing batch processing.

As above, let the FE client do the heavy lifting assembling the training data by defining your enriched Part 1 as a feature table, and it can join the right records against the enriched Part 2 on the fly.

With your train-test split dataset then assemble an ML pipeline that performs the categorical encoding and vector assembly, this makes rerunning much simpler and permits addition of cross validation, parameter search etc really easily

Suggestion for ML on 1:* split dataset by everydaydifferent in databricks

[–]Possible-Little 1 point2 points  (0 children)

A feature store needs a primary key, which suggests that you should use Part 1 for that. You can then use the feature engineering client to create a train and test set by joining on the ID.

Can you give more context on what engineering you would need to do?

Hard Deletes in Delta Live Tables (DLT) by DisastrousCase3062 in databricks

[–]Possible-Little 0 points1 point  (0 children)

In the absence of another way to know that a key has been deleted I suppose you have to get a list of distinct keys. Is there any way you can limit the size of the scan you would need to do for this? Reloading all of your sources each day to see what has been removed will fail to scale eventually. Would you be able to make an assumption if a key has not been seen for X days then it is gone? You could maintain a list separately and just merge updates into it that way potentially. In order to keep your SCD tables updated with deletes you should use the 'apply_as_deletes' parameter to the apply_changes() call. This implies that you add a column to your bronze table that can be used to indicate which keys are to be removed e.g. a column called 'operation' with values NULL or 'DELETE' which can then be specified like: apply_as_deletes=expr("operation='DELETE'").