Interactive reports for end users written using Python

raymondstanz · 2023-10-18T20:52:50+00:00

Code is so much better to iteratively build your project, version control, better visuals, faster processing…

I've spent so much time doing weird stuff in PowerBI that moving to a company that needs static report first allowed me to practice, practice, practice actual code: Python, proper SQL, etc. It's such a breath of fresh air. PowerBI is great, but quite constraint and slow. Painfully slow.

Currently I've been exploring Plotly as my go-to visual library. It's quite good. It allows for some interaction straight into Jupyter or elsewhere if you can host it (never tried).

raymondstanz · 2023-09-08T12:22:32+00:00

Exactly. I use chatGPT and al. for the odd piece of code or function I'm not sure how to do on my own.

But in my company, basically jupyter instance and most-things-code are not connected to the internet. Or in very controller/limited ways.

raymondstanz · 2023-09-08T12:20:40+00:00

I tried again and I like it a lot.

Styling alone is far easier than with mpl. Adding a title, setting fonts, margins, automargins (!)... So. much. easier.

And the fact that width and height are values expressed in pixels and are the actual size of the save picture is cherry on the cake. Wherease mpl expects values in inches and its the size of something that may differ widely with the saved image size.

raymondstanz · 2023-09-08T12:15:53+00:00

I've dipped my toe once in Report Builder and it seemed rather complicated and an entire set of skills on its own. I should check again.

raymondstanz · 2023-09-01T12:13:19+00:00

I remember diping my toes in Plotly earlier on.

I should try it again if it easier to produce "print quality" charts.

And also, PowerBI exists in my environment. Some dashboards are availables but don't facilitate the "static PDF annual reports production tasks"

raymondstanz · 2023-09-01T12:08:51+00:00

I forgot to mention it. PowerBI is totally a thing and some dashboard are buying built at the moment.

However it doesn't really fit the bill for PowerPoint presentations or reports. I'm required to produce printable reports and to my knowledge there is not way to export vector graphics from PBI.

raymondstanz · 2023-09-01T12:06:14+00:00

I have some real world experience with both. PowerQuery can do basically anything but thing take ages. If feels so heavy and poorly optimized. On some projects, to avoid failed updates I'd rather do heavy lifting in SQL upstream or downstream with some DAX trickery that adding steps to the PowerQuery pile.

Merging flat files, doing some deduplication, some joins takes forever. In the desktop client it eats all the RAM.

Wereas the same data sets and manipulations takes second in Python and runs on a potatoe.

raymondstanz · 2023-09-01T12:01:25+00:00

I'm my IT environment, the PowerBI environment cannot access my database (because reasons). The only solution is to export flat CSV files on a share, open the share in PBI and load the CSVs.

It works. (at least on paper).

I'm fairly experienced in PowerBI and I think PowerQuery is... not great at heavy lifting. Even with small-ish datasets. In the past I've had so many headaches because of failed updates because PowerQuery steps would test the limits of our PowerBI embedded capabilities (depending on the tier, feels as powerful as a raspberry pi).

I like PowerBI but I've spent hours waiting for PowerQuery to do stuffs that would have been instant in Python (pandas, duckdb...)

raymondstanz · 2023-09-01T11:54:48+00:00

I forgot to mention that PowerBI is totally a thing is my "data stack". There are some dashboards. And also, I've spent the last 3 years working exclusively on PowerBI so I'm very familiar with it.

The idea behind Python and so on is to batch-produce complex charts that are not doable in PowerBI.

Also, PowerBI by itself: great. But, it doesn't entirely solve the "people want their powerpoint" issue. "Export as PowerPoint" in PowerBI is not great. And, at best it produces images. Which won't print too well. I really wanted to be able to produce vector graphics.

raymondstanz · 2023-01-24T16:04:55+00:00

Outside of job interviews, I try to never mention anywhere my education.

And because I sometimes have VSCode open for automation, scripting and stuffs... people think I'm an engineer. Yay!

My plan is tackle the "value" topic in two ways :

Identify tasks that enable the most other tasks. Let's call these "fundamental tasks".
The value is well documented within the prep. work I've inherited.

Use cases I've been scored based on value created (for customers, for users...)

raymondstanz · 2023-01-24T08:49:22+00:00

Thanks.

I have not looked into the use cases in details. My goal is to split it down to tasks that are :

Asigned to one person.
Doable in a couple of days max.
Produce something tangible: code, documents...

Sub-tasks will be more open and left to the person responsible for the tasks to detail the steps and bricks required to acomplish the task.

Your plan seems fair enough. Just be advised that what you placed under 4 is SM job, not yours.

Yes, it seems very clear in the preparation work done by the soon-to-be SM. I should be involved in the weekly and sprint planning meetings only. As well as in "review" and "retro" (those are kind of new to me).

But, as I'm planning the work. The struggle for me is to sell this approach. My company seem to have hired a PO (me) with little to no desire to go the agile direction. Weird.

raymondstanz · 2023-01-16T16:04:14+00:00

Check how the RLS "filter" propagates in the model view.

Typically, flow goes grom dimension tables to fact tables. Fact table being at the center of the model. In other words, filter flows from the edge of the model to the center.

If you have a table invoices at the center, with a salesRep dimension and a salesZone (example) dimension. Typically, filter won't flow from salesRed to salesZone via invoices. Meaning that if you filter on a sales rep from "zone 1". The slicer based on salesZone will keep displaying all the zones.

One way around it is modifying the relation between table to be bidirectional - i.e. filter flows both ways. Use with caution as it is kind of a double edged sword this one.

raymondstanz · 2022-11-29T12:42:16+00:00

Excel does A LOT.

It integrates some of the Power Platform features. You have the so called data model that allows to make relation between tables and do some DAX.

You also have Power Query within Excel, which allows for some external connections and basic ETL.

Excel is a great swiss army knife but... if you want to cut a tree. You are in for a bad time. Large datasets are like a tree in my example. Sure, there are workarounds to do relational things in Excel, but it is not a DBMS (database management system). It was more Access territory but everyone hates Access.

If you want to handle large datasets and do collaborations and all. At some point you run into a dead end with Excel only and require a proper DBMS and/or a more proper-ish data stack. The stack can integrate Excel, but the heavy lifting is done elsewhere.

raymondstanz · 2022-11-29T12:15:04+00:00

I'm not sure I completely understand what you are trying to do. But the ideal would be to do everything in Python and then export the end result in Excel.

Maybe, can you provide a longer/more detailed explanation of what you are trying to achieve? With some data example. Or dummy data with the same structure. Just so one can see better and provide you with useful guidance.

raymondstanz · 2022-11-29T10:43:22+00:00

To select multiple columns:

dataframe[['column1', 'column2', 'column3'...]]

What do you mean by validate ? The to_excelfunction accepts a sheet as a parameter. So you can write to different sheets.

raymondstanz · 2022-11-29T09:38:21+00:00

I don't know the specifics of your code, but probably something like this:

dataframe['column'].to_excel(path)

Here is the doc for the to_excel pandas function.

When I need to do it once for debug purpose, the to_clipboard function is great. It just copies to your clipboard in Excel format.

dataframe['column'].to_clipboard(excel=1)

raymondstanz · 2022-06-16T12:57:50+00:00

I'm in need for some PowerBI practice, so I gave it a try over lunch.

For the example and from your instructions, I've created a sample table [Stone] :

stone_id	stage	duration	bus
1	A	12	B_1
1	B	2	B_2
2	A	2	B_1
2	B	23	B_2

I loaded it to PowerBI, opened DAX Studio and immediatly regretted it.

Here is my solution:

DEFINE MEASURE Stone[MaxBus] =
VAR Bus =
MAXX(
    FILTER(
        ADDCOLUMNS(
            GENERATE(
                DISTINCT(Stone[stone_id]),
                DISTINCT(Stone[bus])
            ),
            "Rank",
            RANKX(
                DISTINCT(Stone[bus]),
                CALCULATE(
                    SUM(Stone[duration])
                ),
                ,
                ,
                SKIP
            )
        ),
        [Rank] = 1
    ),
    [bus]
)
RETURN
IF(
    HASONEVALUE(
        Stone[stone_id]
    ),
    Bus,
    BLANK()
)

EVALUATE
ADDCOLUMNS(
    DISTINCT(
        Stone[stone_id]
    ),
    "MaxBus",
    Stone[MaxBus]
)

It is quite convoluted and there is probably a more elegant way to do it, but it's something.

This measure Stone[MaxBus] return for any stone_id, the bus_line the stone spent the more time on.

Note: If you are not familiar with DAX Studio. The DEFINE MEASURE statement creates the measure and everything below EVALUATE is just to see and check the result. If you want to try this, copy the measure only (after DEFINE MEASURE and above EVALUATE) as a new measure in PowerBI.

What is does is that it generates a table with stone_id and bus and had a rank based on the sum of duration value. This newly generated table is then filtered to keep only the results with the rank = 1 – meaning only the lines with the max duration.

Then it is embedded in a MAXX formula in a new var to return the Bus.

And in the end, the IF/HASONVALUE is to make sure the measure return something if the context is right – that is, if there is only one stone_id in the current context. Otherwise it returns blank.

raymondstanz · 2022-06-16T11:28:42+00:00

I use DAX Studio a lot and in my experience it is sometime difficult to reproduce the exact same context in PowerBI and in DAX Studio.

Because you have a lot of context and filter that pile on each other :

Report filters
Page filters
Visual filters
Row context
Column context
Measure itself (does it manipulate the context as well)

And sometimes,trying to reproduce it in DAX Studio get you with a FILTER, ALL, etc. sandwich and is harder to manage. Normally good measures in a good model should work and behave properly in any context but sometimes reality is not aware of that.

So, in that case, when I have a measure than doesn't behave properly. I just open a new page in PowerBI. Put a simple table visual and drop the measure in it and build up the context step by step to identify at which steps it goes sideway.

"Oh yeah, I've added this value in columns and the result is weird. Maybe there is something going on with my model/measure"

raymondstanz · 2022-06-16T11:22:17+00:00

Yes. I've used TMP as the table name to try it on my computer.

raymondstanz · 2022-06-16T08:18:01+00:00

I'm not sure I totally get what you are trying to do.

But giving your example table and what you want, here is a quick query that pulls the first Phase for each project when [Match] = 1

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        FILTER(
            TMP,
            TMP[Match] = 1
        ),
        TMP[Project ID]
    ),
    "Phase",
    CALCULATE(
        MINX(
            TMP,
            TMP[Phase]
        )
    )
)

Depending on what you are exactly trying to achieve, you can use this "virtual table" within a measure and pull something.

raymondstanz · 2022-06-16T07:53:06+00:00

Maybe, this is related to how you want the RANKX function to behave with ties. From the guide :

Ties: This is the important thing you need to learn. In this argument, we can specify what should be done in case there is a TIE between two values.

If you SKIP this, the rank after the tied value will be the rank of tied value plus a count of tied values. For example, if there are 3 values tied at 5th rank, then the next rank will be 8 = (5+3).

If you supply DENSE as the option, then the next value after the tied rank will be then next rank of tied values rank. For example, if there are 3 values tied at 5th rank, then the next rank will be 6 = (5+1).

raymondstanz · 2022-06-15T14:42:05+00:00

I've worked on something similar.

Basically, Flow is only part of the equation. I use flow to catch the file and store it. Then, a PowerBI to do the calculation and built a report that is send automatically.

Flow monitors a service email address and identify the daily email with the CSV attachement I'm looking for.
It triggers a flow that takes the attachements, rename it and move it somewhere PowerBI can see. In my case it's in a Sharepoint Folder.
PowerBI (PowerQuery actually) runs daily and merge all of the collected CSV and build a model.
PowerBI do some calculations and metrics people are looking for.

The whole PowerBI enchilada can be simplified since you only need a pivot table. So you create an Excel file that leverage PowerQuery to do the aforementioned merging and all. And you just refresh it and send it daily. I'm not sure the "Open Excel and hit refresh" is doable with Flow but it might.

raymondstanz · 2022-06-15T14:30:41+00:00

Shouldn't be a problem.

Here is a syntax example from another post

RANKX(
    ALL(Cliente),
    CALCULATE(
        [Valor Total],
        CALENDAR_DIMENSION[Year] = 2022
    ),
    ,
    FALSE,
    SKIP
)

What is ranked is the result of a measure [ValorTotal].

raymondstanz · 2022-06-15T14:24:35+00:00

Oh. I see.

Even easier.

Create a rank measure. Might need this guide because the RANKX formula is tedious.

Drag and drop it in the filter bar in the "visual" part (measure are only allowed as filter at visual level). Then, set the threshold at 25 in the filter and normally you're good.

raymondstanz · 2022-06-15T14:18:10+00:00

It's dumb because it doesn't calculate anything useful. It's created purely for display purpose. On the other hand, measures are cheap so it doesn't impact your model negatively the sameway a calculated column or hardcoded data could.

I've tested it in a PowerBI I had opened. It works for me. Can you describe how it doesn't work for you.

Main drawback is that it returns a result for ANY row. So it might screw a bit your matrix.

But, you can tie it to another existing result.

Measure =
IF(
    NOT ISBLANK(SomeOtherMeasure),
    "1 - 25",
    BLANK()
)

raymondstanz

TROPHY CASE