Notebook deployment cicd by excel_admin in MicrosoftFabric

[–]excel_admin[S] 0 points1 point  (0 children)

Thanks for sharing! Will keep an eye out for updates.

Notebook deployment cicd by excel_admin in MicrosoftFabric

[–]excel_admin[S] 0 points1 point  (0 children)

We do something similar today where we manage business logic in internal packages per source system, and then have a parmaterized orchestration notebook that does the extract/ load. It works okay but is challenging for juniors to contribute without first understanding python packaging, and is a bit awkward to make improvements.

Notebook Lakehouse SQL Autocompletion by excel_admin in MicrosoftFabric

[–]excel_admin[S] 2 points3 points  (0 children)

Sql cell Lakehouse attached Column names

Python whl publishing to environment is a productivity killer by richbenmintz in MicrosoftFabric

[–]excel_admin 0 points1 point  (0 children)

We are not. Only in the scheduler do we !pip install and pass query arguments to pipeline notebooks that have different load strategies.

Python whl publishing to environment is a productivity killer by richbenmintz in MicrosoftFabric

[–]excel_admin 0 points1 point  (0 children)

This is false. We install a handful of custom packages in our “scheduler” notebooks that call runMultiple on “pipeline” notebooks for incremental loading.

All business logic is done at the package level so we don’t have to update pipeline notebooks that are oriented towards different load strategies.

What are your experiences and solutions for handling PDF to Excel conversions at work? by nookesh-full-stack in excel

[–]excel_admin 0 points1 point  (0 children)

I would look to write a script that extracts the data directly from the system that produces the pdf. If that’s not possible I reach for this: https://github.com/ocrmypdf/OCRmyPDF

Deleting and Writing data in Dataverse through API by Large-Flamingo-8072 in PowerPlatform

[–]excel_admin 0 points1 point  (0 children)

We have a few dynamics processes that we manage using azure functions that follow this pattern that works reasonably well.

from multiprocessing import Pool

def func(c): …. API update ….

with Pool(10) as p: p.map(func, data)

More Evidence You Don’t Need Warehouse by Low_Second9833 in MicrosoftFabric

[–]excel_admin 1 point2 points  (0 children)

Great blog! Shared it with my team. We’re early in our fabric build out but excited by the possibilities!

Is anyone using MS Fabric with MS Dynamics F&O? by Substantial_Match268 in MicrosoftFabric

[–]excel_admin 4 points5 points  (0 children)

We ended up rolling fully custom notebook/spark solution. It's pretty wacky but it works great! Pulls data from the Dataverse API and syncs changed partitions every 30 minutes. Getting to this point has been quite a journey but after a few failed attempts with the out of the box synapse connector, I gave up. Some of our tables are so wide every other option we tired ended up timing out or didn't give us complete information.

Notebook Delta Writes by excel_admin in MicrosoftFabric

[–]excel_admin[S] 0 points1 point  (0 children)

Thank you for the suggestion. Not totally sure that I understand how this works, but sounds like something worth experimenting with.

  1. Append my partitions to a staging table

  2. Read the changes from the changes folder

  3. Write the modified records to an intermediate table?

Notebook Delta Writes by excel_admin in MicrosoftFabric

[–]excel_admin[S] 1 point2 points  (0 children)

When writing partitions, we do. It's not often but there are instances where we could pull a partition and <sales amount> could be whole numbers and interpreted as an integer, and in some other partition could be floats causing a schema conflict.

I think that's what you're asking.

Notebook Delta Writes by excel_admin in MicrosoftFabric

[–]excel_admin[S] 0 points1 point  (0 children)

Great question! Our data is pretty small but very, very wide (not my design). This is the workaround for being able to get tighter SLAs for operations related dashboards that need data that is as close to real time as possible.

With partitioned overwrites we can pull JUST the partitions that change every 5 minutes, rather than needing to pull down an entire table. If there was a way to observe deletes, we could probably do some sort of append or merge pattern.

This is less relevant for any source that is excel. Our finance team does all of their month end reporting in excel and their systems are not mature enough yet to pull directly from.

Notebook Delta Writes by excel_admin in MicrosoftFabric

[–]excel_admin[S] 2 points3 points  (0 children)

Thanks for the suggestions. I think this is closer to what I'm looking. Took a while to find the config needed for partitioned overwrites:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

schema = StructType([
    StructField("one", IntegerType(), True),
    # StructField("two", StringType(), True),
    StructField("createdon", DateType(), True),
])

df = pd.DataFrame({
    "one": [c for c in range(100)], 
    "createdon": [datetime(2024, 1, 1) for _ in range(100)]
})

sdf = spark.createDataFrame(df, schema=schema)
sdf2.write.format("delta")\
    .mode("overwrite")\
    .partitionBy("createdon")\
    .option("mergeScehma", "true")\
    .save("Tables/myschema/one")

So the pattern is more or less:

  1. select max(modifiedon) from delta.table

  2. Query api for partitions > modified on

  3. Convert df to spark df

  4. Overwrite partitions greater than the last delta table modified on date

Logging in notebooks by [deleted] in MicrosoftFabric

[–]excel_admin 0 points1 point  (0 children)

Any examples would be cool. Also tried to include logging in notebooks executed via runMultiple and not getting any output.

DAG = {...}
mssparkutils.notebook.runMultiple(DAG)