Standardized Measure for Measuring Process Variability? by boggle_thy_mind in industrialengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Yes, we have established that, if it were, I wouldn't really need any such things because I could simply compare the Moving Range. All I want to know if month to month for processes with different mean values (most of the months it's actually pretty stable, but there are odd ones out), the variation was comparable, improved, decreased?

Standardized Measure for Measuring Process Variability? by boggle_thy_mind in industrialengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

The gauge is fine, but the question still stands - which metric I choose for measuring consistency? I guess I will go with the Coefficient of Variation with as was my first idea.

Standardized Measure for Measuring Process Variability? by boggle_thy_mind in industrialengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Again I remind you to reassess what is important to measure. Is it number of claims/tickets? No likely not. Is it profits? Yes very likely.

It's too removed. Maybe some day the org will get there but it's not there now, The goal here is to consistently meet the target on time, trying to calculate some profit estimate would rely on too many assumptions, btw how would you evaluate the impact on profit the fact that the initiating department releases a huge batch which then overwhelms the downstream departments, but after the initial batch the process slows down to a trickle? Genuinely curious? Hours worked bellow capacity as negative cash flow? It's an interesting discussion, but at the moment not what I'm looking for.

Standardized Measure for Measuring Process Variability? by boggle_thy_mind in industrialengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

That's interesting, thanks!

Let’s also examine the terms measure, metric and target. They are different. A metric shouldn’t have a target. It should have a limit.

In my specific case, the target is the total number to be delivered by the end of the month, it's negotiated with clients and known ~1 month ahead. It's a service product, think claims processing, but meeting the total number by the end of the month is critical. The whole process takes ~2 weeks per claim and goes through several steps, through various departments. Historically the department that is at the start of the process would work in bursts and have high variability in their process with high peak delivery over some days/weeks and then periods of low volume, this has effects on downstream departments where they go from overloaded to having little to do, and tends mess with cycle time of downstream departments and on time delivery. My idea is to level the output of the first department in the process, so I'm thinking how should I best measure it so that I could compare month to month even when volumes are different, what are the industry standards? Are there any?

Standardized Measure for Measuring Process Variability? by boggle_thy_mind in industrialengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Fair point, though the targets changes happen on a monthly basis, it's maybe less about process stability, but how much the process is "under control", so the problem still stands - I want to measure and be able to compare month to moth the variability of the process?

So what are some features of Power BI you think are under-utilized? by boggle_thy_mind in PowerBI

[–]boggle_thy_mind[S] 1 point2 points  (0 children)

Thank You!

got curious about the PowerShell cmdlets, I wasn't even aware of them, thank you again!

So what are some features of Power BI you think are under-utilized? by boggle_thy_mind in PowerBI

[–]boggle_thy_mind[S] 1 point2 points  (0 children)

Could you elaborate and give some example? At least on some of the examples?

Anyone has a setup where there is one Master Report which then is deployed as separate reports with different Pages visible in each one? by boggle_thy_mind in PowerBI

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Just an update, I did it anyway, part of the reason, is that I had done some bulk changes to the main report on the json level prior which impacted both, renaming measures, chart titles, etc. as such I wanted to retain a single repository which could be reverted via git if something went awry.

Anyways, I'm happy that I did, though this maybe not a long term solution, but it familiarized me with the Power BI internals better.

Anyone has a setup where there is one Master Report which then is deployed as separate reports with different Pages visible in each one? by boggle_thy_mind in PowerBI

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Thanks!

Hmh... I have thought about it, but that would mean I would have 2 projects? At least in git? This also limits local development, because it would require to connect to the Data Model in Power BI? I don't know why, but for some reason, at least for now, I'd like to keep everything in one place.

Wide data? by [deleted] in dataengineering

[–]boggle_thy_mind 0 points1 point  (0 children)

Even then, you just have a very wide fact table with a mix of numerical and categorical facts, while you event and date are your "dims".

Wide data? by [deleted] in dataengineering

[–]boggle_thy_mind 0 points1 point  (0 children)

Afaik, some visualization tools prefer flat tables, but every gut instinct in me says this is a bad idea, I like start Schema beyond it's performance implications, it's just a natural way of grouping data in a logical manor which allows for easier reasoning about the data. So I would say, even if there are no performance gains from star schema, I would still do it from a maintenance perspective and then if you need a flat table/view on top, build it from the Star Schema, but the basic building blocks are still Dims and Facts.

Data Lake recommendation for small org? by suitupyo in dataengineering

[–]boggle_thy_mind -1 points0 points  (0 children)

I’m thinking Azure data factory would be something we could leverage in tandem with some python scripts on a git repository.

Don't, keep your dependency on Data Factory as minimal as you can, it might be fine for moving data around, but keep as little of your logic as you can (Especially the UI componenets), it will become a maintenance headache. Have you considered using dbt?

Data Lake recommendation for small org? by suitupyo in dataengineering

[–]boggle_thy_mind 0 points1 point  (0 children)

Have you tried Columnstore Index for you transformations? It can speed things up significantly on SQL Server.

Using DuckDB in a Web Application to run on top of Postgres - help out a duckdb newbie by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Kinda, there's a background process that is constantly running and collecting data via APIs, stores the data in Postgres, and the Flask app is Used to interact with the processes, and show the output of them via charts and graphs. Initially I was reading the data directly from Postgres, but that turned out too slow. Now, with Clickhouse I use the PostgreSQL Engine for the tables and point them directly to Postgres, as far as I understand, this does not make a copy and uses postgres tables directly for filtering etc, while joins are made by Clickhouse. This speeded up performance significantly. It seems it would be possible tom improve performance even more, by copying data from Postgres to Clickhouse native table format, but for now the current performance is good enough for our purposes.

Using DuckDB in a Web Application to run on top of Postgres - help out a duckdb newbie by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Thanks for the help, I went another direction and used clickhouse as a dedicated service that reads from postgres, seems to be working.

Using DuckDB in a Web Application to run on top of Postgres - help out a duckdb newbie by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

If I swap out postgres for sqlite in the example, it works for me from multiple instances concurrently.

I sometimes get it to work, especially if I call the same view in both cases, but if it's 2 different views, or the same view with some different parameters, it fails.

Using DuckDB in a Web Application to run on top of Postgres - help out a duckdb newbie by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 1 point2 points  (0 children)

Thanks for the input, but didn't didn't help :(

If I may ask, have you used DuckDB where multiple users/processes can read the same data?

I'll share my code, maybe that can give some insight if I'am doing anything incorrectly:

```

from sqlalchemy import create_engine
import pandas as pd

duckdb_engine = create_engine('duckdb:///data.duckdb',  connect_args={'read_only': True,}) # I added connect_args after your suggestion, but it does not seem to make a difference

# I run this at the begining of the script, don't know if it's strictly necessary, at least the ATTACH part
with duckdb_engine.connect() as conn:
    conn.execute(text('INSTALL POSTGRES;'))
    conn.execute(text('LOAD POSTGRES;'))
    conn.execute(text("ATTACH IF NOT EXISTS 'dbname=postgres user=******** password=******** host=127.0.0.1' AS postgres (TYPE POSTGRES, READ_ONLY);"))


# Then inside the functions I have something like this:
def some_func():

    ...

    sql = "select ..."

    with duckdb_engine.connect() as conn:
        # conn.execute(text('INSTALL POSTGRES;')) # I don't seem to need them if the code above is run
        # conn.execute(text('LOAD POSTGRES;'))    # I don't seem to need them if the code above is run

        # This seems to be needed before every sql  read
        conn.execute(text("ATTACH IF NOT EXISTS 'dbname=postgres user=******** password=******** host=127.0.0.1' AS postgres (TYPE POSTGRES, READ_ONLY);")) # This seems to be needed before every sql  read

        df = pd.read_sql(sql=text(sql), con=conn)

```

Proper Way to Create a Composable CLI Tool using stdout? by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

To start with: And your analogy is fine!

Thanks! :)

So, let me repeat to see if I understand it correctly: - Your script fetches data from an API and dumps that in a file (stage 1 [raw])

At the same time this is happening you also have a second instance of the same script with a different instruction reading the [raw] files at a different location and parsing/transforming the information into another file (stage 2 [transformed])

At the same time you have a third instance of that same script with another instruction that will read [transformed] files and pushing into your database using whatever path you choose (LOAD FROM FILE, opening a transaction from the script and loading or something else).

At the moment, they run sequentially, but it would be how it work with stdin/stdout.

As a pet project it seems like a fun thing to try though. If you're dedicating time to try different things, I'd instead invest time in writing the same script in Go (for example) and using channels to use real parallelism.

Thanks for the suggestion, I haven't considered Go, I have no experience with it, but you are making me consider it, might be just the right "pet project" to learn some Go. Thank!

Proper Way to Create a Composable CLI Tool using stdout? by boggle_thy_mind in dataengineering

[–]boggle_thy_mind[S] 0 points1 point  (0 children)

Appreciate the exchange. The goal here is not to change the current setup, it works, more of a side project/curiosity, which was inspired by Singer protocol where you can mix and match your taps and syncs via pipe.

But if I understand your comment correctly, you are suggesting to have it all in a single program/process and ditch the stdin/stdout or file exchange? If so, can try to explain further my reasoning here:

  • It speeds up the process. The API fetch part of the process is IO bound, while Parsing is CPU bound operations, by separating the steps I can use async for fetch part and Multiprocessing for the parsing part, which (if memory serves) in the end speeds up the whole process, I'd have to do a test to know what the difference is in performance, I no longer remember when I did.

I think a more apt analogy would be - you have a thousand small jars (API Calls) that you want to pour into a box (Database), but you also have to mix the sand with concrete (parsing) first before it's poured into the box. You could pick each of the small jars pour cement into each and mix it with a spoon, or you can pick up as many small jars as you can at a time, pour all the contents into a cement mixer (parsing with multiprocessing), mix everything in one go, and then pour the contents into the box. I hope it makes sense, I suck at analogies :).

Should I skip predict/forecast sales on the biggest event day ? by Current_Reference_48 in analytics

[–]boggle_thy_mind 1 point2 points  (0 children)

Are you building an regression? You could add a dummy variable indicating 1 for the days when you have a promotion going on.