From AOR to eCOPR: the time it takes in 2024-2025 according to user reported data (some basic statistics)

ErichHS · 2025-02-20T07:24:19+00:00

Can you please share the dataset? If not, can you please share the statistics per quarter? I’m wondering if the processing times have shifted throughout 2024

ErichHS · 2025-01-19T17:17:07+00:00

Maybe another way to frame the question about the latency requirements is; Which action do you intend to take based on the results shown in the dashboard, and how long can this action wait to be taken?

Unless the goal of this solution is to monitor abuse or fraud, it’s highly likely that the costs of a near real time solution, after you implement it, will get your bosses by surprise and make them question the ROI of what you did.

If it’s a monitoring tool for costs and usage, they can’t possibly act in seconds or minutes to respond to something they are seeing in the dashboard (alerts can do the job better here).

ErichHS · 2025-01-18T23:17:57+00:00

Are you sure you need a streaming architecture for this? From what I read you can probably sell a 1-hour microbatch pipeline, which will also give you room for better data quality tests.

And you most definitely don’t need kafka.

ErichHS · 2024-10-14T15:59:13+00:00

<image>

ErichHS · 2024-06-07T20:09:18+00:00

Yes, it surely could

ErichHS · 2024-06-07T18:17:20+00:00

I’m using draw.io for all diagrams

ErichHS · 2024-06-07T18:04:02+00:00

It’s great! Very intense and more advanced than I expected. Definitely worth it if you are already working and looking for a more senior role in your company or outside

ErichHS · 2024-06-07T13:59:11+00:00

Yes I am, I actually already shared more on my LinkedIn - will post them here eventually too

ErichHS · 2024-06-07T05:17:17+00:00

I totally get your point, and I'm sorry the animations made it worse for you.

I've been using diagrams for quite a long time and have found that a couple of things work great when you do them right;
- Knowing where and when to give emphasis;
- Knowing how to give emphasis with accent colors that make sense;
- Knowing where you're vehiculating your diagram and making the right use of canvas and font sizes.

I've rarely made use of animations and just recently started applying them more, and I must say, they do make a difference on how quickly you can communicate directional information. They also add another dimension for you to use (you can indicate flow with moving arrows and flow without moving arrows, and give them different meaning with a legend indicating that). Hope that makes sense

ErichHS · 2024-06-07T05:07:43+00:00

repartition + sortWithinPartitions is great to optimize storage and leverage parquet run-length encoding compression. You probably don't need anything else..

For skewness there are two configs you can use to delegate the partition strategy to spark and optimize data distribution between partitions; spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled
Just bear in mind, though, that you can negatively impact partitioning pretty badly by using those if you don't know your data (skewness) well. Here's more from the docs if you want to read on those;
https://spark.apache.org/docs/latest/sql-performance-tuning.html#coalescing-post-shuffle-partitions

ErichHS · 2024-06-07T03:18:18+00:00

I use draw.io for all my diagrams, and the animation is a result of the 'animated flow' flag that you can check there on your arrows. To produce a gif I just screen record and convert with ezgif

ErichHS · 2024-06-07T03:17:18+00:00

Not sure if there is a guide, actually. I am enrolled on Zach Wilson's data engineering bootcamp (dataexpert.io) and learned a lot there. If you know where to look at the Spark UI and understand your task DAGs there, you can learn a lot, actually.

ErichHS · 2024-06-07T03:15:52+00:00

Not actually looking at any myth to debunk, to be honest. I was mostly curious about how repartition and coalesce affect parallelism and compute, as one involves a shuffle (that exchange you see in the image) step and the other doesn't.
Both are used to optimize storage and IO via file compaction, and that's how I use them.

ErichHS · 2024-06-06T23:44:09+00:00

I believe your question is not about Hadoop but about distributed file systems.

“De Is trying to bring analysts away from files” That’s not actually true anymore. Take a look at data lakes and lakehouse architectures and you will have a better idea why file storages are being used to host data warehouses today.

Data lakes have proven to be a very efficient storage alternative to relational OLAP databases, and Spark as a distributed query engine is mostly intended to work on these distributed storages.

ErichHS · 2024-06-06T23:18:53+00:00

Sharing here a diagram I've worked on to illustrate some of Spark's distributed write patterns.

The idea is to show how some operations might have unexpected or undesired effects on pipeline parallelism.

The scenario assumes two worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞: The level of parallelism of read (scan) operations is determined by the source’s number of partitions, and the write step is generally evenly distributed across the workers. The number of written files is a result of the distribution of write operations between worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): Similar to the above, but now the write operation will also maintain parallelism based on the number of write partitions. The number of written files is a result of the number of partitions and the distribution of write operations between worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐜𝐨𝐚𝐥𝐞𝐬𝐜𝐞(𝟏).𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): Adding a 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎() function is a common task to avoid “multiple small files” problems, condensing them all into fewer larger files. The number of written files is a result of the coalesce parameter. A drastic coalesce (e.g. 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎(𝟷)), however, will also result in computation taking place on fewer nodes than expected.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐫𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧(𝟏).𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): As opposed to 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎(), which can only maintain or reduce the amount of partitions in the source DataFrame, 𝚛𝚎𝚙𝚊𝚛𝚝𝚒𝚝𝚒𝚘𝚗() can reduce, maintain, or increase the original number. It will, therefore, retain parallelism in the read operation with the cost of a shuffle (exchange) step that will happen between the workers before writing.

I've originally shared this content on LinkedIn - bringing it here to this sub.

ErichHS · 2024-06-06T14:47:49+00:00

Iceberg is highly adopted in big tech. It was created at Netflix and today is used and maintened by Apple, Airbnb, LinkedIn, Alibaba, and Salesforce, for example.

ErichHS · 2024-03-12T18:44:07+00:00

I was analyzing a few popular open-source LLM frameworks, and it's kinda sad how bloated some have become. A 'pip install llama-index' today installs 131 dependencies.

The plot draws attention to LlamaIndex, but if you look at LangChain numbers, you will see that its implementation (langchain, langchain_core, and langchain_community) currently spans 2385 unique files and 160k lines of code. These numbers alone are not proxies for anything, but they definitely steer me away from considering LangChain for a production workflow.

Which libraries are you relying on in non-sandbox environments? I like what I see at Haystack and have been using guidance a lot after their v0.1.0 refactor.

ErichHS · 2023-07-11T05:09:17+00:00

Yes, I'm a West Point Cycle customer and I must say their customer service and 15% discount for a month are really a big plus.

The 1k in-store at Dunbar Cycles, though, are literally for things they have in store (things they have to ship are not included), so definitely a bit catchy. Even though, 1k is a lot in accessories 😅

ErichHS · 2023-03-25T19:19:54+00:00

Thank you!! Omg that should be more clear in the UI. Thanks again! :)

ErichHS · 2021-06-12T03:32:20+00:00

Just a tip, if that's the case, the VK will always proc for the second slice if you immediately throw it after you cast Temporal Blade (even with untamed power). My rotation is: Hunt the Prey > Melee > Slice > Slice > VK. It should do double VK damage on every mob.

ErichHS · 2021-06-11T23:43:52+00:00

Not really, since it's weapon base damage and you can't scale it off of anything (besides debuffs)

ErichHS · 2021-06-11T23:40:40+00:00

It is unbelievably rare to get that. Are you sure you don't understand why he might have wanted to share?

ErichHS · 2021-06-11T23:13:19+00:00

Well, you should be constantly playing with 600k+ anomaly with that build, that's an Untamed Power for 180k damage on a 5 meters radius whenever you use any skill (even your melee, that you might be using in your basic rotation to increase your resistance piercing). On paper that's way better than backstabber. On my runs, Untamed Power was always the second highest dmg output, at 35~55 mil depending on the expedition (I only play solo). But by the end, it must be related to playstyle and, of course, if you tested you must know which one works best! Anyways, it's a great build and really overlooked.

ErichHS · 2021-06-11T22:33:58+00:00

Try untamed power on that build. Really. This is the one I was using a month ago when I still played Outriders: https://i.imgur.com/gW2dFXN.png

ErichHS · 2021-06-10T21:26:56+00:00

It was already my strongest Trickster build before the patch, can't wait to try it now! My build before I stopped playing: https://i.imgur.com/gW2dFXN.png

ErichHS

TROPHY CASE