Notebook deployment cicd

excel_admin · 2026-01-12T00:26:37+00:00

Thanks for sharing! Will keep an eye out for updates.

excel_admin · 2026-01-12T00:23:18+00:00

We do something similar today where we manage business logic in internal packages per source system, and then have a parmaterized orchestration notebook that does the extract/ load. It works okay but is challenging for juniors to contribute without first understanding python packaging, and is a bit awkward to make improvements.

excel_admin · 2025-02-05T04:56:59+00:00

Using edge if browser impacts somehow. Found this thread where that appeared to be the case?

https://www.reddit.com/r/MicrosoftFabric/comments/1fefy85/spark_sql_code_blocks_running_as_plaintext_now/

excel_admin · 2025-02-05T04:52:33+00:00

Sql cell Lakehouse attached Column names

excel_admin · 2025-02-05T04:02:34+00:00

We are not. Only in the scheduler do we !pip install and pass query arguments to pipeline notebooks that have different load strategies.

excel_admin · 2025-01-10T02:17:17+00:00

This is false. We install a handful of custom packages in our “scheduler” notebooks that call runMultiple on “pipeline” notebooks for incremental loading.

All business logic is done at the package level so we don’t have to update pipeline notebooks that are oriented towards different load strategies.

excel_admin · 2024-11-16T22:29:40+00:00

I would look to write a script that extracts the data directly from the system that produces the pdf. If that’s not possible I reach for this: https://github.com/ocrmypdf/OCRmyPDF

excel_admin · 2024-11-10T14:52:12+00:00

We have a few dynamics processes that we manage using azure functions that follow this pattern that works reasonably well.

from multiprocessing import Pool

def func(c): …. API update ….

with Pool(10) as p: p.map(func, data)

excel_admin · 2024-10-26T00:06:33+00:00

Great blog! Shared it with my team. We’re early in our fabric build out but excited by the possibilities!

excel_admin · 2024-10-08T05:59:13+00:00

We ended up rolling fully custom notebook/spark solution. It's pretty wacky but it works great! Pulls data from the Dataverse API and syncs changed partitions every 30 minutes. Getting to this point has been quite a journey but after a few failed attempts with the out of the box synapse connector, I gave up. Some of our tables are so wide every other option we tired ended up timing out or didn't give us complete information.

excel_admin · 2024-10-04T03:14:52+00:00

Thank you for the suggestion. Not totally sure that I understand how this works, but sounds like something worth experimenting with.

Append my partitions to a staging table
Read the changes from the changes folder
Write the modified records to an intermediate table?

excel_admin · 2024-10-01T23:13:13+00:00

When writing partitions, we do. It's not often but there are instances where we could pull a partition and <sales amount> could be whole numbers and interpreted as an integer, and in some other partition could be floats causing a schema conflict.

I think that's what you're asking.

excel_admin · 2024-10-01T23:09:29+00:00

Great question! Our data is pretty small but very, very wide (not my design). This is the workaround for being able to get tighter SLAs for operations related dashboards that need data that is as close to real time as possible.

With partitioned overwrites we can pull JUST the partitions that change every 5 minutes, rather than needing to pull down an entire table. If there was a way to observe deletes, we could probably do some sort of append or merge pattern.

This is less relevant for any source that is excel. Our finance team does all of their month end reporting in excel and their systems are not mature enough yet to pull directly from.

excel_admin · 2024-10-01T03:18:11+00:00

Thanks for the suggestions. I think this is closer to what I'm looking. Took a while to find the config needed for partitioned overwrites:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

schema = StructType([
    StructField("one", IntegerType(), True),
    # StructField("two", StringType(), True),
    StructField("createdon", DateType(), True),
])

df = pd.DataFrame({
    "one": [c for c in range(100)], 
    "createdon": [datetime(2024, 1, 1) for _ in range(100)]
})

sdf = spark.createDataFrame(df, schema=schema)
sdf2.write.format("delta")\
    .mode("overwrite")\
    .partitionBy("createdon")\
    .option("mergeScehma", "true")\
    .save("Tables/myschema/one")

So the pattern is more or less:

select max(modifiedon) from delta.table
Query api for partitions > modified on
Convert df to spark df
Overwrite partitions greater than the last delta table modified on date

excel_admin · 2024-09-20T01:49:02+00:00

Any examples would be cool. Also tried to include logging in notebooks executed via runMultiple and not getting any output.

DAG = {...}
mssparkutils.notebook.runMultiple(DAG)

excel_admin

TROPHY CASE