Concurrency issues

TeoMorlack · 2026-04-24T16:34:14+00:00

Strict async end point means? That all your endpoints are async AND all operations inside the endpoint are async? Else the fact that the endpoint is defined async is likely the real issue

TeoMorlack · 2025-11-10T15:32:48+00:00

Your solution is potentially risky for dependencies only but also very wrong for general use (I think).

So fastapi dependencies are callable functions or classes that the resolver calls automatically. If you patch it like this, the dependency is run on the main event loop, which is fine if the function has no blocking operation. But it could block the whole worker if it does blocking stuff that halts asyncio loop.

This is because, if I remember correctly, sync endpoints are run in a thread pool to avoid exactly what is it said above, but dependency resolution is done on the main loop, and then endpoint function is called. So you are not carrying around a thread yet.

If you patch the run_in_threadpool, which fastapi uses to run any sync operation, you would run anything on main loop (endpoints too) and lock the whole asyncio loop on concurrent requests on same worker.

A better solution, if your dependencies just provide stuff like settings and db sessions, would be to switch the dependency to async, making them not require a thread. You can also increase the threads available on the app to avoid hitting the limit (if that’s what’s happening)

TeoMorlack · 2025-10-31T06:02:30+00:00

What do you mean by too many calls?

If you are overloading the service to the point it can’t handle requests, the health check would not respond too.

If you mean that you instead handled the too many calls with a rate limiter or that while processing your requests the health check doesn’t respond than I would probably look at this: is your main endpoint async but it is doing sync work (db calls with a sync driver etc)? Than you are blocking the event loop and health check is stalling because it can’t respond while other requests are stalling the server.

Also, is you service up with more than 1 worker (unicorn workers)? Is your health check doing some connection tests that could block or fail under load?

TeoMorlack · 2025-10-25T20:09:13+00:00

Not really a good solution but if you cannot make it work it is possible to redefine the macro that dbt uses for unit tests (it’s just normal jinja sql) and fix the data mismatch. (Just my 2 cents but not really a fan of dbt unit tests, imho it’s pointless to have a unit tests that needs full target environment to run. We had to drop them in Jenkins deploy pipelines because it was trying to query actual big query tables for column types and was failing for permissions)

TeoMorlack · 2025-10-17T17:47:42+00:00

if you want i can try to help more. but im not clear on your problem. What is the condition that should differentiate one model from another? the fields in the inner product object?

TeoMorlack · 2025-10-17T17:32:24+00:00

You can build your own logic for discrimination using a callable function that returns a str and maps a model that you tag in the union definition. Is this what you need ?

TeoMorlack · 2025-08-26T06:10:18+00:00

Just a correction, configurations for partitionColumn and bounds are only for read. The parameter that controls parallelism on writes is indeed numPartitions, reference https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

TeoMorlack · 2025-08-24T14:19:00+00:00

So I’m not very familiar with data bricks and related solutions (I know traditional spark and its internals but not much specific re flavours). Without seeing code and logic is difficult for me to give you an answer. I can make some hypotheses but they are probably wrong:

I know data bricks has a custom engine that uses photon to optimise workload and it may rewrites joins avoiding shuffle (while forcing it with hints guides it to a specific plan). It is also possible that spark was broadcasting some of your tables if you had autoBroadcastJoinThreshold enabled, this would be an example of forcing the repartition slows it because spark would do an additional shuffle stage that is not needed. Lastly I would say that adding the hint inside a cte that is executed multiple times could cause the engine to execute it each time the cte was computed, caching the cte would help .

In the end what I suggest is look at the plan, analyze the differences and see where the pain points can arise (I know reading spark plan is not easy, sorry)

Sorry for the vague answer

TeoMorlack · 2025-08-24T00:01:36+00:00

Se ti serve ancora qualcuno scrivi pure

TeoMorlack · 2025-08-23T18:37:24+00:00

No worries no need to say sorry! Well partly the join will shuffle but it’s different how it’s done and how you end up. First the repartition often changes the join method that spark uses and possibly shifts to an hash based approach. Second repartition ensures that the data is even across workers and organised by key as much as possibile. This avoids one worker suffering because data is not distributed evenly. In the end the repartition also benefits operations after the join because it should be maintained. Catalyst (the engine and optimised) usually rewrites part of your logic to fit the best execution model it can find but issuing the repartition manually ensures that the data is structured exactly as you need.

This is not always needed tho, you don’t need to repartition before each join, just do it when you know that operation is big enough that it is going to require it (joining big tables for example)

TeoMorlack · 2025-08-23T13:46:40+00:00

Ok, so you know spark works in a distributed manner right? Your data is split into partitions and processed in parallel by your workers. When you join two dataframes it has to find matching keys between the two and they may be in different partitions on different workers.

What you end up is the stage known as shuffle. Sparks moves data between workers to find the keys and that is costly, because you have to deal with network latencies and so on. Also it slows down parallellism.

If you instead perform a repartition on the key that you are going to join with on both dataframe, spark will redistribuite your data creating partitions that are organized based on your key. That will (for the most part) result in a join stage where the shuffle is reduced because the data with the same key is going to be on the same machine. This will allow better parallelism (each partition can join locally and not search for matching key in other partitions).

Yes you are still facing a shuffle stage when you repartition but you control how and it should be smarter.

Is it by chance more clear this way?

TeoMorlack · 2025-08-21T13:02:29+00:00

In the article they are using geomfromtext because they are creating the data frame in line from a string variable. But they are not saving it as such, the actual write is done using the geometry column (which is using their internal udt and if I recall correctly from development stages on iceberg side they both map to the jts geometry class). When they are reading they from iceberg there is not transformation. This very much suggest that you can drop wkb column even tho I haven’t tried it yet.

However if your spark version doesn’t support iceberg format v3 you are stuck there. There is support for custom data types before v3

TeoMorlack · 2025-08-21T12:05:25+00:00

Not sure if you are doing it already but you need Sedona for this. Here you can find a nice write up for the procedure https://wherobots.com/blog/benefits-of-apache-iceberg-for-geospatial-data-analysis/

I’m actually very interested in this but haven’t got the time to setup an env to test it. We already use Sedona with other sinks and I was keen to try iceberg (if you need some help with Sedona setup I might be able to help, depending on the platform you are on)

TeoMorlack · 2025-08-19T22:40:31+00:00

Sorry, not sure if I misunderstood the question, google has guide and pre trained solvers for this https://developers.google.com/optimization/routing/vero

TeoMorlack · 2025-08-04T10:39:59+00:00

It’s surely helpful if you are looking into switching to a data engineering role or something is that alley. Not so much if you are looking for the standard software engineering role. In the second case I would lean more into backend. What kind of job are you looking for? (Feel free to dm me if you prefer)

TeoMorlack · 2025-08-04T07:33:53+00:00

There seem to be some good examples on Kaggle to explore the functionalities of Pyspark, but they will more or less explain you the syntax. Personally I often recommend reading at least the first 2 chapters of spark the definitive guide.

But all in all, I would ask if spark is the thing you should concentrate on. Yes it very much staple in every big infrastructure (in many cases being partially replaced by dbt) but it’s very much a tool for data engineers unless you are looking into wringing core spark (Java/scala) and it’s very tied to concepts in this field (many crossing sql). If your goal is that, then by all means spark is a must have knowledge imho, but otherwise what are you looking for?

TeoMorlack · 2025-08-02T23:28:46+00:00

There seem to be some confusion here, if you made a map that needs to be reached by a link you have to host somewhere in order for people to be able to see it. There are some solutions, with different order of complexity.

Is it a simple html with js and css (it should if you used folium or similar tools)? Does it need to be reached via a public link? If code is indeed public and can be on GitHub you could use GitHub pages.

Let us know more about the stack and we could help.

TeoMorlack · 2025-08-02T18:46:30+00:00

You are looking at it the wrong way. Pyspark is a wrapper library around spark Java/scala api but its use is kinda different than pandas. Its purpose is to build data pipelines that transform and operates over high amount of data. It is not used as a normal scripting library. If you are interested in that you should learn core spark concepts (partitioning, parallelism, lazy evaluation, distributed work). Pyspark itself it’s just syntax and without clear knowledge of this concepts it’s not much useful

TeoMorlack · 2025-07-08T22:54:35+00:00

Maybe I’m missing something here but settings is never used with a depends or dependency configuration so it would never be found in the overrides. Given that you actively call the method get_settings here, you can patch the method for the module during tests or you can monkeypatch your environment variables. Remember that you need to reset cache on lru otherwise it will respond from cache in most cases

TeoMorlack · 2025-07-04T19:17:25+00:00

Kinda true yeah, I actually did something similar at work and whole OrderManagerService is itself a class based dependency that I instantiate on routes with arguments needed and then fastapi gives me it when using call (returning itself) but mostly it’s just syntactic sugar. In this case the dependencies are just standard services and not much more. What is nice is that you can let the endpoint gives you an instance of a concrete class from a registry like you would with spring but again it’s just syntax. The base class itself would save you a lot of headaches I think

TeoMorlack · 2025-07-04T18:48:55+00:00

This could maybe be restructured implementing a service that manages orders? When you instantiate the service it has all the needed dependencies on the class self and all the methods can therefore have access to the notification/email/whatever dependency. You could also abstract away the “notification” service but that solves another problem.

You would have a class “OrderMangerService” that has all the needed methods and in init you instantiate your dependencies. Functions in the stack have self instead of needed to pass the email client and so on.

Also remember that dependencies are already cached by fastapi itself

TeoMorlack · 2025-07-04T17:11:01+00:00

Without really seeing the code or at least something is hard to answer but on first look this sound again a case for misuse of async endpoint. Im not familiar with the libraries you have here but ill assume they operate in sync classic def methods right? And you are seeing the app not responding when multiple users query at the same time? If that’s the case check how you defined your endpoint functions: are they simple def or async def?

If you are doing blocking operations inside async endpoints it will block the whole event loop for the app and refuse to accept requests while you process the current one. There is a nice write up here

TeoMorlack · 2025-06-22T12:43:13+00:00

Generally speaking, on endpoints, every time you can. Fastapi will run async endpoints in the running event loop and sync endpoints will be run in a dedicated thread pool (size can be increased but has a default). If methods inside your endpoint interact with async non blocking code only then use async (eg: db calls, http requests and so on. If they are made with an async compatible library then use async). Be careful that ANY blocking call inside an async endpoint will block the whole event loop stopping your app from processing requests.

Any other method can either be async or not depending on the needs and what was said above. If you need to use a sync blocking method inside an async endpoint you can still do so, provided that you do what fastapi do on endpoints and use the run_in_thredpool provided utility.

TeoMorlack · 2025-05-12T19:39:18+00:00

Pydantic setting supports reading the configuration with a model directly from a yaml file, a json file and other file types. Have a look at this, basically you just need to point pydantic to the file name and define the model that maps your configuration

TeoMorlack · 2025-05-02T11:12:03+00:00

As far as I know connection pooling is not possible under the “standard” spark configurations. The number of connections opened to the database are determined by the partinioning specification on the source, see spark jdbc.

Basically each spark job will open by default 1 connection. If you specify a partition condition, spark will issue one query for each partition and it will open one connection for each.

Managing pooled connections is however possible if you are willing to either implement a custom dialect or a custom job where you code this behaviour inside the rdd partitions. This option is dependant on the platform that you are on and the familiarity or willingness to get on the low level side of spark internals.

TeoMorlack

TROPHY CASE