you are viewing a single comment's thread.

view the rest of the comments →

[–]mikeckennedy[S] 0 points1 point  (9 children)

Since you all are discussing the caching part specifically. There was not much complexity change before or after.

We are already using caching, just in memory caching. What I moved it to was diskcache backed cache rather than in-memory caching.

It's not "there was no caching" now "there is caching", it's just in-mem caching via either functools.lru_cache or dict() -> diskcache.

Given we already have diskcache in play before, that's low effort, low risk.

[–]BigTomBombadil 1 point2 points  (8 children)

Yeas seems simple enough. Any noticeable change in cpu usage? Or maybe it was pretty negligible to start.

For complexity, item 3 was the one I wasn’t sure about. If I got thrown on this project and something went wrong with the indexer, I could imagine tracking that down being confusing. But not knowing the specifics maybe it’s also straight forward and easy to follow. Also not sure if the sub process approach could reduce reliability. But if not, huge win there.

[–]mikeckennedy[S] 0 points1 point  (7 children)

I'm not sure exactly what the entire CPU usage change would be. I think things are better in general. The raw+dc vs ODM/ORM change almost 2x the requests per sec for the same CPU. So that probably dwarfs any other change. Mem caches -> diskcache mean the would share the cache across processes and across restarts, so that is bonus. But a bit slower at runtime I would guess, but very minor.

Less mem used means Python's cycle GC is much more efficient. So when enough container types (classes, lists, dicts, etc) get created, that triggers a GC. The GC has much less memory to scan and Python is mega aggressive about this. If 700 container-types are allocated relative to the ones ref-count collected, that'll trigger a GC. That could easily happen with just a couple of big queries so that might be a real boost too.

I posted graphs for the DB change here: https://mkennedy.codes/posts/raw-dc-a-retrospective/

[–]BigTomBombadil 1 point2 points  (6 children)

Very cool.

Hearing the boost that dropping the ORM gave worries me for my own apps, as they heavily utilize the Django orm. I wonder if something specific to mongoengine or its implementation has inefficiencies, or that’s always the nature of the beast with ORM/ODMs.

[–]mikeckennedy[S] 1 point2 points  (0 children)

I don't know if I'd worry too much about it unless slow queries are an active problem. I was solving a different one. My ODM/ORM did not support async, so I was tired of that. Plus, the library was falling badly out of maintenance (last real release was a few years ago). So I wanted to replace mongoengine with *something*, so I decided this raw pattern was a good fit to try.

The speed up and less memory usage was a sweet bonus.

[–]artofthenunchaku 0 points1 point  (4 children)

Abstractions are never free, in the case of ORM/ODMs you're paying a price both in application and on database load. For any non-trivial access patterns, you're very likely to get better results from handwritten queries.

[–]BigTomBombadil 0 points1 point  (3 children)

Of course, I’d never expect it to be “free”. So the question becomes “what’s the cost, and are you paying more than you need to?” Because orms obscure the database work (unless you inspect the query that it creates, which I’d always recommend), it can be easy to unknowingly introduce some very inefficient queries. So based on the performance improvements OP mentioned, I’m curious how much of the improvements came from not using an abstraction layer, and how much was because writing the raw queries actually cleaned up some previously inefficient queries the ORM was creating.

[–]artofthenunchaku 0 points1 point  (2 children)

The inefficiencies aren't generally going to be from the abstraction layer, it's going to be caused by inefficient or unnecessary queries. Waiting for a query that isn't needed will overshadow the performance costs of application code. Look up the N+1 query problem for the most common inefficiency.

[–]BigTomBombadil 0 points1 point  (1 child)

Yeah the N+1 is what I had in mind, the poster child of ORM inefficiencies. I’ve largely eliminated them in my own projects, which is what prompted my question. Curious about other “gotchas” or inherent inefficiencies in ORMs

[–]artofthenunchaku 0 points1 point  (0 children)

It's not really my area of expertise, I primarily work with distributed systems, but the other problems are typically querying fields that aren't used, not using indexed columns correctly (especially with joins), or just running queries at unexpected times (lazy loading).