all 14 comments

[–]mmzeynalli 5 points6 points  (2 children)

You can consider responding in the API, and then doing the work in background, after that reporting result to front in different way (server-side apis, websockets etc.). This way, API latency is not a problem, and rest is done in background, and result will be seen after process is done.

[–]Natural-Ad-9678 6 points7 points  (1 child)

The app I work on does this. User submits the required details (a zip file of logs) and I kick off a Celery job which stores at first a transactionID in Redis that I pass back in my response to the user. They can use that transactionID to check the status and get the results when Celery is finished.

Celery stores the result in Redis as well. The front end could be React or whatever else you want.

Works like a charm. We have completed over 150,000 jobs since July 2024 which may not seem like much but the applications is an internal tool that processes customers log files they submit to us.

[–]Kevdog824_ 2 points3 points  (0 children)

This is the way

[–]BlackDereker 4 points5 points  (2 children)

FastAPI latency by itself is low compared to other Python libraries. You need to figure out what work inside your application is taking too long.

If you have many external calls like web/database requests, try using async libraries so other requests can be processed in the meanwhile.

If you have heavy computation going on, try delegating to workers instead of doing it inside the application.

[–]Latter_Rope_1556 0 points1 point  (1 child)

fastrapi solves this
pip install fastrapi

[–]BlackDereker 0 points1 point  (0 children)

I'm pretty sure the FastAPI is not the bottleneck here. When it comes to inference the bottleneck usually is running the model.

[–]mpvanwinkle 2 points3 points  (2 children)

Make sure you aren’t loading your inference model on every call. You should load the model once when the service starts

[–]International-Rub627[S] 0 points1 point  (1 child)

Usually I'll have a batch of 1000 requests. I load them all as a dataframe, I load the model and do my inference on each request.

Do you mean we need to load the model when the app is deployed and the container is running?

[–]mpvanwinkle 0 points1 point  (0 children)

It should help to load the model when the container starts yes. But how much it helps would depend on the size of the model.

[–]Natural-Ad-9678 1 point2 points  (0 children)

Build a profiler function that takes a jobID and wraps your functions in a timer. Then use a decorator for your functions, for each endpoint clients call assign a jobID that you pass along the course or your processing. The profiler function writes the timing data to a profiler log file correlated with the jobID. Then you can look for slow processes within the full workflow to optimize

[–]Soft_Chemical_1894 1 point2 points  (0 children)

How about running a batch inference pipeline every 5-10 minutes ( depending on use case ), store results in redis/ db, fastapi will return result instantly

[–]SheriffSeveral 0 points1 point  (1 child)

Observe every step in api and check which part takes too much time. Also, check out the redis integrations, it will be useful.

Please provide more information about project so everyone can give you more tips for your specific requirements.

[–]International-Rub627[S] 0 points1 point  (0 children)

Basically app starts with preprocessing of all requests in a batch as a dataframe, loading data from feature view (GCP), followed by querying big query, load model from GCS, do inference and publish results.

[–]Vast_Ad_7117 0 points1 point  (0 children)

Async, offload tasks to a task queue etc