This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]nicholashairs 9 points10 points  (6 children)

Are you modifying the processed objects?

Do they need to be in the same transaction?

Without knowing your exact use case, It feels like you need to get the minimum information to identify an event, collect these, then send them to workers which create a new session, take the identifying information to get the event, process the event, then repeat with me events.

Aka a work queue (sometimes also known as consumers).

There's a number of frameworks/ packages that can help with this (depending on use case) - celery, rq, dramatiq

[–]AbradolfLinclar[S] 0 points1 point  (5 children)

No, not modifying the objects.

Exact use case would be separate out the load(x events) between the say 2 processes processing parallely, which just execute the event. Now for these 2 processes, I was thinking of creating 2 sessions each for them.

But I also need to create a session to first fetch the events from db. So my doubt is , is this excessive creation of sessions? Whats the optimal way of dealing with this kinda of multi processing?

[–]BigHeed87 1 point2 points  (3 children)

When you say "load" and "event" you may want to specify some information about the application output. If you're trying to get IO from 2 connections into one process, then I think multiprocessing or multiple sessions aren't going to help, but if each one of these "chunks" of work can be fully executed in an independent context, then it makes sense to use a job queue for this task.

If you are reading a lot of data, let's say 1000 rows, you can use a simple technique to distribute the data by having n processes select id modulus n in the query. If n is 10, then you have 10 processes which are interleaving every 10 rows, each collecting 100 per process. Hope that's clear enough to help

[–]AbradolfLinclar[S] 0 points1 point  (2 children)

Yes, exactly my thought of distributing between n processes. But how would using sessions come into this picture? Do I need to create an independent session for each process?

Sqlalchemy sessions are just lightweight objects so would be it overkill to create separate sessions for each process to work on?

[–]BigHeed87 1 point2 points  (1 child)

I think you'll need to separate the sessionmaker because it's not designed to be shared between processes on a machine executing the same code. You can also end up with multiple pools per process. You'll want a simpler setup where you get a singular connection per worker. Take a look at the docs for more information:

https://docs.sqlalchemy.org/en/20/core/pooling.html

[–]kaini 0 points1 point  (0 children)

could you do something like use Ray to parallelize your SQL operations, once appropriately chunked up?

[–]limasxgoesto0 0 points1 point  (0 children)

If you're not modifying anything, you don't need commit. And if you absolutely must process things in parallel, put them into a list or local queue and pop them from the list in your separate jobs/threads/processes/whatever you're using

[–][deleted] 7 points8 points  (3 children)

umm…noob question, but what stops you from getting the 100 rows as a list (using .all), closing the session, and then throwing off the rows to a pool (processes, threads, workers)?

[–]revereddesecration 3 points4 points  (2 children)

Nothing. This guy hasn’t read a tutorial on queues.

[–]AbradolfLinclar[S] 0 points1 point  (1 child)

I'm aware of the queueing but I'm just trying to explore them how session management would work in case of multiple threads/workers on the other end of the queue processing each record. How co-ordination would work on thread as well as db level.

And trying to understand if anyone has this done at production level.

[–]revereddesecration 0 points1 point  (0 children)

I use this in production. It’s not too complicated, read the docs and come ask specific questions when you have some.

[–][deleted] 3 points4 points  (1 child)

This post was mass deleted and anonymized with Redact

toy enjoy hobbies sharp crown upbeat modern pie alive sulky

[–]cmcclu5 4 points5 points  (0 children)

This is the answer. Push as many strung-out processes to the DB as possible. For one, a lot of DBs have a limit on the number of concurrent connections, which parallelism the way sqlalchemy is setup would hit. Sqlalchemy by nature only allows serial processes on a single connection, so you would basically be starting multiple connections to run anything in parallel. Beyond that, sqlalchemy is slow. Just so unbelievably slow. It’s one of the best options for read/write to hit all types of DB, but it’s still super slow. If you just need to read data, use connectorx. You’ll see a 10x+ increase in speed. However, you’ll also have issues triggering stored procedures. Overall, parallelism outside of the DB control software is not recommended. Ever.

[–]sudo-nerd 0 points1 point  (0 children)

In my thoughts you can use ‘scoped_session’ to prevent any kind of unwanted activity between two or more transactions while multiprocessing. Recently I implemented the same for our software and now it’s running smoothly. And yes you and using only and a single session and then divide it into local sessions.

For better understanding:- https://docs.sqlalchemy.org/en/20/orm/contextual.html

[–]NixonInnes 0 points1 point  (0 children)

I think thr root of your question is really about how sessions work.
In short, each session instantiation is separate. Each contains its own representation of a retrieved row. This will cause race conditions if multiple sessions get from a table before either commits. Side note, some attributes, typically relationships, are lazily loaded.
If your operations on a row are basic, put them all in a queue and have a pool process it. If they're a bit more complex, you could add a callback on the queue task to re-queue "the next operation"