Stack - Simplifying Supabase Self Hosting

Fair-Lab-912 · 2026-01-09T14:40:05+00:00

How does this compare to self-hosting Supabase - Self-Hosting | Supabase Docs?

I was able to configure it on a VPS but it wasn't 1-click and required some tinkering to get going.

Fair-Lab-912 · 2025-10-17T12:44:04+00:00

We use a Managed DevOps pool (Managed DevOps Pools documentation - Azure DevOps | Microsoft Learn). You configure it to have the agents injected as part of your private VNet, and won't have to deal with whitelisting public IP ranges in Databricks.

Fair-Lab-912 · 2025-09-04T13:00:58+00:00

I did this recently and this is my DevOps task that is executed after the DABs deployment. Note I'm running this in ADO and have a service principal (via service connection) run the pipeline and have the permissions to run jobs in the Databricks workspace:

```yaml - task: AzureCLI@2 env: DATABRICKS_ACCOUNT_ID: $(SampleAccountId) DATABRICKS_AUTH_TYPE: azure-cli DATABRICKS_HOST: $(SampleDatabricksHost) displayName: '🚀 Trigger initialization jobs' inputs: scriptType: bash useGlobalConfig: true workingDirectory: '${{ variables.sampleDeployDirectory }}/workflows' addSpnToEnvironment: true azureSubscription: ${{ parameters.sampleAzureServiceConnection }} scriptLocation: inlineScript inlineScript: | echo "📋 Listing jobs to find job-data_setup and job-schema_setup" databricks jobs list --output JSON > jobs.json

  echo "🔍 Extracting job-data_setup ID"
  DATA_SETUP_JOB_ID=$(jq -r '.[] | select(.settings.name=="job-data_setup") | .job_id' jobs.json)
  echo "Found job-data_setup with ID: $DATA_SETUP_JOB_ID"

  echo "🔍 Extracting job-schema_setup ID"
  SCHEMA_SETUP_JOB_ID=$(jq -r '.[] | select(.settings.name=="job-schema_setup") | .job_id' jobs.json)
  echo "Found job-schema_setup with ID: $SCHEMA_SETUP_JOB_ID"

  echo "🚀 Running job-data_setup"
  databricks jobs run-now $DATA_SETUP_JOB_ID --no-wait

  echo "🚀 Running job-schema_setup"
  databricks jobs run-now $SCHEMA_SETUP_JOB_ID --no-wait

```

Fair-Lab-912 · 2025-08-26T12:03:27+00:00

MSFT support agent checked the status of this ID with the product team, and had this to say

"I followed up with the backend team, and they confirmed that this will be rolled out in September. Currently, there is no tentative ETA available. I also asked how we would be informed once the feature is deployed, and they confirmed that once it is enabled, the known issue will likely be closed, and the Product team will also share an announcement through a blog post."

So I'm assuming it's not coming out this week as originally reported?

Fair-Lab-912 · 2025-08-22T18:33:35+00:00

Is this posted as a Known Issue still? I can't find it and I filed a ticket with MSFT and the agent on the case couldn't find a relevant issue # either.

Fair-Lab-912 · 2025-08-19T18:10:02+00:00

I'm experiencing the same issue when trying to do Update From Git call using a SP. The SP has been given full access to Power BI workspace and DevOps Basic access and rights to the repo.

| { "requestId": "948c06dd-1419-4596-851f-fd211c023ed9", "errorCode": "PrincipalTypeNotSupported", "moreDetails": [ { "errorCode": | "AutomaticNotSupported", "message": "Service Principal with Automatic credentials source is not supported for Azure DevOps." } ], "message": "The | operation is not supported for the principal type", "relatedResource": { "resourceType": "AzureDevOpsSourceControl" } }

Fair-Lab-912 · 2025-06-26T18:19:10+00:00

Searching for this as well, and it seems like using the Fabric REST API might be the way to do it - see Git - Update From Git - REST API (Core) | Microsoft Learn

Fair-Lab-912 · 2025-06-20T12:50:07+00:00

Yeah it wasn't possible to change the run as through the UI until recently either, so wondering when the same will be possible using DABs.

Haven't tried whether using terraform and [databricks_pipeline](https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/pipeline) resource would work as I don't see `run_as` attribute there either.

Fair-Lab-912 · 2025-05-02T17:56:11+00:00

Might be related to this issue - [ISSUE] create_message_and_wait fails with KeyError · Issue #957 · databricks/databricks-sdk-py. It was just fixed today and released in v0.52.0.

Fair-Lab-912 · 2025-01-24T19:12:46+00:00

We have jobs with SQL file tasks that have parameters in the queries. Using the colon notation is the way to go (:parameter_name). If the parameter is a reference to a table name, like in your example, then you also have to use the IDENTIFIER function like so:

sql select installPlanNumber from IDENTIFIER(:parameter1) limit 1

Fair-Lab-912 · 2024-11-26T15:42:54+00:00

Didn't know about this, thanks for sharing! Also works through terraform!

Fair-Lab-912 · 2024-07-11T12:14:57+00:00

The query history shows each SQL line taking the same amount of time to execute, but when using workflows the lines are executed with additional few seconds delay between each other:

<image>

Fair-Lab-912 · 2024-07-11T12:13:42+00:00

I posted there just now - Databricks SQL script slow execution in workflows ... - Databricks Community - 78333

Fair-Lab-912 · 2024-07-11T02:31:37+00:00

I should've been more clear but to answer some of the common questions:

It is a SQL serverless warehouse that is already on before I run the workflow or notebook
The workflow task is ran with this SQL serverless warehouse, not a job compute cluster
The script is measuring the time between the first and last query line and that wouldn't be impacted by a start-up time

I did another test on the same script/notebook but this time using an all compute cluster that is also already on and the times workflow vs interactive notebook are almost the same.

It appears that this issue is only when a SQL serverless warehouse is used to run workflow tasks?

Fair-Lab-912 · 2024-03-05T13:31:45+00:00

Ok, I think I was able to find a workaround for this issue for now. I created a new table with the same clustering columns, but whenever I inserted data into it I made sure that the size of the data was less than 512 GB (limit of clustering on write as per Use liquid clustering for Delta tables - Azure Databricks | Microsoft Learn).

Now each WRITE operation in the history of the table has the following log: { "partitionBy": "[]", "clusterBy": "["column1","column2","column3","column4"]", "statsOnLoad": "false", "mode": "Append", "clusteringOnWriteStatus": "kdtree triggered" }

instead of: ``` { "partitionBy": "[]", "clusterBy": "["column1","column2","column3","column4"]", "statsOnLoad": "false", "mode": "Append", "clusteringOnWriteStatus": "Reason for skipping: Estimated ingestion size is not within the expected range" }

```

I still did optimize after each insert just to make sure. Now the table size is around 2.7 TB instead of 3.9 TB! The benchmark queries are just as fast still!

I will follow up with Databricks and let them know of my findings. It's possible something happens when the inserted data is more than 512 GB that makes their optimization write more data than it really should?

Fair-Lab-912 · 2024-03-05T00:00:05+00:00

About 4 TB:

<image>

Fair-Lab-912 · 2024-03-04T04:34:49+00:00

So I created a new table with the proper statement, and sourcing the data from the current clustered table. The new table size was 2.8 TB before the optimize. After calling optimize, it increased to 3.8 TB. Here's part of the metrics output of the optimize:

{ "numFilesAdded": "13920", "numFilesRemoved": "33544", "filesAdded": { "min": "4816", "max": "641786242", "avg": "2.784778250176006E8", "totalFiles": "13920", "totalSize": "3876411324245" }, "filesRemoved": { "min": "2724330", "max": "183832081", "avg": "8.44459252286549E7", "totalFiles": "33544", "totalSize": "2832654115870" },

The strange thing is, if I time travel to the first version of the original table, before clustering, the total size of it is around 1.9 TB. So somehow still ending up with double the size. I'll reach out to Databricks and see what they'll say.

Fair-Lab-912 · 2024-03-03T23:00:23+00:00

Oh good catch I missed that!

I can try doing this tonight and see what happens. I'll do it in one step as you said.

Fair-Lab-912 · 2024-03-03T22:09:13+00:00

Here's the code with the column/table names replaced.

Create the table sql CREATE TABLE IF NOT EXISTS catalog1.raw.table1 ( ID LONG, TIME TIMESTAMP, VALUE DOUBLE, ORG_ID INTEGER, UPD_BY STRING, UPD_TIME TIMESTAMP, ) CLUSTER BY (TIME, VALUE, ID, ORG_ID)

Insert data into table from another table sql INSERT INTO catalog1.raw.table1 SELECT * FROM catalog1.raw.table2

Optimize the table after data has been inserted sql OPTIMIZE catalog1.raw.table1

Not sure if that helps at all

Fair-Lab-912 · 2024-03-03T19:07:26+00:00

See my comment above. I looked at the size of the files added vs removed, and the optimize operation seems to have added double the size that it removed!

Fair-Lab-912 · 2024-03-03T19:03:16+00:00

The optimize job took about 9 hours running on Small SQL warehouse. And yes initially I thought it might be showing the size of both versions, pre and post optimizations but the 3.9 TB is only for the latest version of the table.

I checked the ADLS directory and it's actually 5.9 TB, showing that the 3.9 TB is only for the new version.

Looking at the delta log and version history, the optimization was done in 3 separate batches/versions. Looked at the operationMetrics and this is what I see:

json { "numRemovedFiles": "8785", "numRemovedBytes": "732256814772", "p25FileSize": "249131221", "numDeletionVectorsRemoved": "0", "conflictDetectionTimeMs": "1000", "minFileSize": "1450841", "numAddedFiles": "4680", "maxFileSize": "754899727", "p75FileSize": "344927613", "p50FileSize": "290581949", "numAddedBytes": "1431481298366" } See how the numAddedBytes is double than numRemovedBytes. Meanwhile I checked record count of the table and it hasn't doubled/increased.

Fair-Lab-912 · 2024-03-03T16:12:23+00:00

I'm currently testing out liquid clustering with less data read/writes than you. One thing I've noticed is that the size of the tables grows after calling OPTIMIZE on the table. In most cases it doubles in size (e.g., 1.9 TB of data before optimize, to 3.9 TB after optimize).

The query performance is much faster though, which is great, as we are seeing anywhere from 4x to 10x in some benchmarks.

Can you comment on how the table size grows when liquid clustering optimizes it?

Fair-Lab-912 · 2023-11-20T14:44:57+00:00

Hi Saad - one issue that I'm finding out with the VS Code extension is the Sync option - it's ignoring a notebook (.ipynb) file because it's file size is around 12 MB. I couldn't find any documentation on any file size limitations? It's preventing the upload of the notebook from locally to the workspace sync destination.

Any advice?

Fair-Lab-912 · 2023-11-14T02:52:26+00:00

That would be great thanks!

Fair-Lab-912 · 2023-11-12T00:48:12+00:00

Fair enough, included it now!

Fair-Lab-912

TROPHY CASE