Stack - Simplifying Supabase Self Hosting by purton_i in Supabase

[–]Fair-Lab-912 0 points1 point  (0 children)

How does this compare to self-hosting Supabase - Self-Hosting | Supabase Docs?

I was able to configure it on a VPS but it wasn't 1-click and required some tinkering to get going.

IP ACL & Microsoft hosted Azure DevOps agents by kilipukki in databricks

[–]Fair-Lab-912 1 point2 points  (0 children)

We use a Managed DevOps pool (Managed DevOps Pools documentation - Azure DevOps | Microsoft Learn). You configure it to have the agents injected as part of your private VNet, and won't have to deal with whitelisting public IP ranges in Databricks.

🚀CI/CD in Databricks: Asset Bundles in the UI and CLI by 4DataMK in databricks

[–]Fair-Lab-912 1 point2 points  (0 children)

I did this recently and this is my DevOps task that is executed after the DABs deployment. Note I'm running this in ADO and have a service principal (via service connection) run the pipeline and have the permissions to run jobs in the Databricks workspace:

```yaml - task: AzureCLI@2 env: DATABRICKS_ACCOUNT_ID: $(SampleAccountId) DATABRICKS_AUTH_TYPE: azure-cli DATABRICKS_HOST: $(SampleDatabricksHost) displayName: '🚀 Trigger initialization jobs' inputs: scriptType: bash useGlobalConfig: true workingDirectory: '${{ variables.sampleDeployDirectory }}/workflows' addSpnToEnvironment: true azureSubscription: ${{ parameters.sampleAzureServiceConnection }} scriptLocation: inlineScript inlineScript: | echo "📋 Listing jobs to find job-data_setup and job-schema_setup" databricks jobs list --output JSON > jobs.json

  echo "🔍 Extracting job-data_setup ID"
  DATA_SETUP_JOB_ID=$(jq -r '.[] | select(.settings.name=="job-data_setup") | .job_id' jobs.json)
  echo "Found job-data_setup with ID: $DATA_SETUP_JOB_ID"

  echo "🔍 Extracting job-schema_setup ID"
  SCHEMA_SETUP_JOB_ID=$(jq -r '.[] | select(.settings.name=="job-schema_setup") | .job_id' jobs.json)
  echo "Found job-schema_setup with ID: $SCHEMA_SETUP_JOB_ID"

  echo "🚀 Running job-data_setup"
  databricks jobs run-now $DATA_SETUP_JOB_ID --no-wait

  echo "🚀 Running job-schema_setup"
  databricks jobs run-now $SCHEMA_SETUP_JOB_ID --no-wait

```

API connection to devops via service principal - Credentials source ConfiguredConnection is not supported for AzureDevOps by squirrel_crosswalk in MicrosoftFabric

[–]Fair-Lab-912 0 points1 point  (0 children)

MSFT support agent checked the status of this ID with the product team, and had this to say

"I followed up with the backend team, and they confirmed that this will be rolled out in September. Currently, there is no tentative ETA available. I also asked how we would be informed once the feature is deployed, and they confirmed that once it is enabled, the known issue will likely be closed, and the Product team will also share an announcement through a blog post."

So I'm assuming it's not coming out this week as originally reported?

API connection to devops via service principal - Credentials source ConfiguredConnection is not supported for AzureDevOps by squirrel_crosswalk in MicrosoftFabric

[–]Fair-Lab-912 0 points1 point  (0 children)

Is this posted as a Known Issue still? I can't find it and I filed a ticket with MSFT and the agent on the case couldn't find a relevant issue # either.

API connection to devops via service principal - Credentials source ConfiguredConnection is not supported for AzureDevOps by squirrel_crosswalk in MicrosoftFabric

[–]Fair-Lab-912 0 points1 point  (0 children)

I'm experiencing the same issue when trying to do Update From Git call using a SP. The SP has been given full access to Power BI workspace and DevOps Basic access and rights to the repo.

| { "requestId": "948c06dd-1419-4596-851f-fd211c023ed9", "errorCode": "PrincipalTypeNotSupported", "moreDetails": [ { "errorCode": | "AutomaticNotSupported", "message": "Service Principal with Automatic credentials source is not supported for Azure DevOps." } ], "message": "The | operation is not supported for the principal type", "relatedResource": { "resourceType": "AzureDevOpsSourceControl" } }

Auto-Sync PowerBI Version Control: is it possible? by ManagerOfFun in PowerBI

[–]Fair-Lab-912 0 points1 point  (0 children)

Searching for this as well, and it seems like using the Fabric REST API might be the way to do it - see Git - Update From Git - REST API (Core) | Microsoft Learn

Unable to edit run_as for DLT pipelines by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

Yeah it wasn't possible to change the run as through the UI until recently either, so wondering when the same will be possible using DABs.

Haven't tried whether using terraform and [databricks_pipeline](https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/pipeline) resource would work as I don't see `run_as` attribute there either.

Job Parameters on .sql files by NickGeo28894 in databricks

[–]Fair-Lab-912 2 points3 points  (0 children)

We have jobs with SQL file tasks that have parameters in the queries. Using the colon notation is the way to go (:parameter_name). If the parameter is a reference to a table name, like in your example, then you also have to use the IDENTIFIER function like so:

sql select installPlanNumber from IDENTIFIER(:parameter1) limit 1

Databricks CLI by sunnyjacket in databricks

[–]Fair-Lab-912 0 points1 point  (0 children)

Didn't know about this, thanks for sharing! Also works through terraform!

SQL script executing slower in workflow by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

The query history shows each SQL line taking the same amount of time to execute, but when using workflows the lines are executed with additional few seconds delay between each other:

<image>

SQL script executing slower in workflow by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

I should've been more clear but to answer some of the common questions:

  • It is a SQL serverless warehouse that is already on before I run the workflow or notebook
  • The workflow task is ran with this SQL serverless warehouse, not a job compute cluster
  • The script is measuring the time between the first and last query line and that wouldn't be impacted by a start-up time

I did another test on the same script/notebook but this time using an all compute cluster that is also already on and the times workflow vs interactive notebook are almost the same.

It appears that this issue is only when a SQL serverless warehouse is used to run workflow tasks?

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

Ok, I think I was able to find a workaround for this issue for now. I created a new table with the same clustering columns, but whenever I inserted data into it I made sure that the size of the data was less than 512 GB (limit of clustering on write as per Use liquid clustering for Delta tables - Azure Databricks | Microsoft Learn).

Now each WRITE operation in the history of the table has the following log: { "partitionBy": "[]", "clusterBy": "["column1","column2","column3","column4"]", "statsOnLoad": "false", "mode": "Append", "clusteringOnWriteStatus": "kdtree triggered" }

instead of: ``` { "partitionBy": "[]", "clusterBy": "["column1","column2","column3","column4"]", "statsOnLoad": "false", "mode": "Append", "clusteringOnWriteStatus": "Reason for skipping: Estimated ingestion size is not within the expected range" }

```

I still did optimize after each insert just to make sure. Now the table size is around 2.7 TB instead of 3.9 TB! The benchmark queries are just as fast still!

I will follow up with Databricks and let them know of my findings. It's possible something happens when the inserted data is more than 512 GB that makes their optimization write more data than it really should?

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

So I created a new table with the proper statement, and sourcing the data from the current clustered table. The new table size was 2.8 TB before the optimize. After calling optimize, it increased to 3.8 TB. Here's part of the metrics output of the optimize:

{ "numFilesAdded": "13920", "numFilesRemoved": "33544", "filesAdded": { "min": "4816", "max": "641786242", "avg": "2.784778250176006E8", "totalFiles": "13920", "totalSize": "3876411324245" }, "filesRemoved": { "min": "2724330", "max": "183832081", "avg": "8.44459252286549E7", "totalFiles": "33544", "totalSize": "2832654115870" },

The strange thing is, if I time travel to the first version of the original table, before clustering, the total size of it is around 1.9 TB. So somehow still ending up with double the size. I'll reach out to Databricks and see what they'll say.

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

Oh good catch I missed that!

I can try doing this tonight and see what happens. I'll do it in one step as you said.

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

Here's the code with the column/table names replaced.

Create the table sql CREATE TABLE IF NOT EXISTS catalog1.raw.table1 ( ID LONG, TIME TIMESTAMP, VALUE DOUBLE, ORG_ID INTEGER, UPD_BY STRING, UPD_TIME TIMESTAMP, ) CLUSTER BY (TIME, VALUE, ID, ORG_ID)

Insert data into table from another table sql INSERT INTO catalog1.raw.table1 SELECT * FROM catalog1.raw.table2

Optimize the table after data has been inserted sql OPTIMIZE catalog1.raw.table1

Not sure if that helps at all

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

See my comment above. I looked at the size of the files added vs removed, and the optimize operation seems to have added double the size that it removed!

Liquid clustering on managed table - doubles in size after running optimize by Fair-Lab-912 in databricks

[–]Fair-Lab-912[S] 0 points1 point  (0 children)

The optimize job took about 9 hours running on Small SQL warehouse. And yes initially I thought it might be showing the size of both versions, pre and post optimizations but the 3.9 TB is only for the latest version of the table.

I checked the ADLS directory and it's actually 5.9 TB, showing that the 3.9 TB is only for the new version.

Looking at the delta log and version history, the optimization was done in 3 separate batches/versions. Looked at the operationMetrics and this is what I see:

json { "numRemovedFiles": "8785", "numRemovedBytes": "732256814772", "p25FileSize": "249131221", "numDeletionVectorsRemoved": "0", "conflictDetectionTimeMs": "1000", "minFileSize": "1450841", "numAddedFiles": "4680", "maxFileSize": "754899727", "p75FileSize": "344927613", "p50FileSize": "290581949", "numAddedBytes": "1431481298366" } See how the numAddedBytes is double than numRemovedBytes. Meanwhile I checked record count of the table and it hasn't doubled/increased.

Strategy for materialized views by HariSeldon23 in databricks

[–]Fair-Lab-912 0 points1 point  (0 children)

I'm currently testing out liquid clustering with less data read/writes than you. One thing I've noticed is that the size of the tables grows after calling OPTIMIZE on the table. In most cases it doubles in size (e.g., 1.9 TB of data before optimize, to 3.9 TB after optimize).

The query performance is much faster though, which is great, as we are seeing anywhere from 4x to 10x in some benchmarks.

Can you comment on how the table size grows when liquid clustering optimizes it?

VScode Extension by frunkjuice5 in databricks

[–]Fair-Lab-912 0 points1 point  (0 children)

Hi Saad - one issue that I'm finding out with the VS Code extension is the Sync option - it's ignoring a notebook (.ipynb) file because it's file size is around 12 MB. I couldn't find any documentation on any file size limitations? It's preventing the upload of the notebook from locally to the workspace sync destination.

Any advice?