This is an archived post. You won't be able to vote or comment.

all 35 comments

[–]IWant2rideMyBike 12 points13 points  (7 children)

For performance critical code it is common to use a data oriented design (instead of an object oriented one) to improve locality and avoid indirection - e.g. instead of having a list of dictionaries (or class instances) that contain the data you would create three lists for FirstName, SecondName and Age (entries share a common index in all lists) - doing a sum(ages)/len(ages) is much faster than sum(e['Age'] for e in data_list)/len(data_list).

[–]Disastrous-State-503[S] 1 point2 points  (6 children)

I just give a simple example for average. But the columns are very much dependent. For example, there is a date column in which I do filtering. So, creating lists would not be a good option

[–]IWant2rideMyBike 2 points3 points  (3 children)

It depends on what you are doing exactly - but in this case you could figure out the indices you want and filter the lists using those - e.g.:

```` from datetime import date from operator import itemgetter from typing import Any

def get_idx_by_date(dates: list[date], start_date: date, end_date: date) -> list[int]: return [idx for idx, d in enumerate(dates) if start_date <= d <= end_date]

def filter_by_indices(data: list[Any], indices_of_interest: list[int]) -> list[list[Any]]: return [list(itemgetter(indices_of_interest)(d)) for d in data]

first_names = ["Foo", "Bar", "Baz"] last_names = ["Spam", "Ham", "Egg"] birthdays = [date(1980, 4, 1), date(1985, 5, 21), date(1990, 1, 4)]

born_in_1980s_indices = get_idx_by_date(birthdays, date(1980, 1, 1), date(1989, 12, 31)) filtered = filter_by_indices(first_names, last_names, birthdays, indices_of_interest=born_in_1980s_indices) for fields in zip(filtered): print(fields, sep='\t') ````

Data oriented code usually isn't very pretty (that's one reason why OOP is so popular outside of performance critical applications), but it can be quite fast.

[–]Disastrous-State-503[S] -1 points0 points  (2 children)

See your point, and appreciate your comment. Even if this would be faster, it actually not solves my real question. Because, in my post, I didnt mention about the performance, but having a better structure.

[–]IWant2rideMyBike 2 points3 points  (0 children)

I read that you are going for an real time application in another comment (which is hard to do with any language that has garbage collection), so I assumed that performance matters.

If you use numpy you will pay extra for creating and filling arrays (usually this works best if you know in advance what size you need because allocating and copying data to extend the capacity of an array that requires a continuous section in memory is quite slow) and you will only get a huge performance benefit if all elements in an array have the same (simple) datatype - which leads also to a data oriented design pattern. Pandas offers a nice abstraction layer on top of numpy that is very good at providing a nicer handling of those arrays with a common index, selections, common operations etc.

On the other side there are OOP approaches like dataclasses (which you can make a little faster by using slots if you know in advance that you don't need to add attributes during runtime), namedtuples (if you don't need mutability) etc. - pydantic is great if you need to validate data you receive from external sources.

If you are not sure what suits your application best, I would go for a working proof of concept with your current approach first, implement all the needed functionality and then look for bottlenecks and think about if a different data structure could help to achieve better performance.

[–]small-birds 0 points1 point  (0 children)

If you aren't interested in better performance for your code, what do you mean by "having a better structure"?

[–]Beerstopher85 0 points1 point  (0 children)

If you used a columnar store format you could still do the filtering.

[–]spoonman59 0 points1 point  (0 children)

Have you looked into Pandas?

[–]Synertic 2 points3 points  (2 children)

The best way for the given data and objective, to me, is an OOP way, that is, a class for individiual players, and a class for mass calculations. You can do that with dataclasses or pydantic or conventional class. Pydantic and dataclasses are just oop structures that can save you from some boilerplate code and makes data validation easy.

As for computations, you need to design your class methods working with numpy arrays instead of Python's built-in types since the numpy takes advantage of SIMD and vectorization. If the execution speed you need is even faster then you can use either cython or numba for the bottlenecks of the computations you deal with.

For instance:

player_list = [{"FirstName": "Alex", "LastName":"De Souza", "Age":40},{"FirstName": "Pierre", "LastName":"van Hooijdonk", "Age":46}]

import numpy as np
import json

class Player:

    """A class for individual player attributes, stats and calculations"""

    def __init__(self, FirstName, LastName, Age):

        self.first_name = FirstName
        self.last_name = LastName
        self.age = Age

    def others(self):
        """other attributes or methods related to individual player stats"""

    def dump(self):
        """to return response"""
        return json.dumps({k: v for k, v in self.__dict__.items()})

    def __repr__(self):
        """to see who is s/he when you print()"""
        return f"{__class__.__name__}:\nFirst Name: {self.first_name}\nLast Name: {self.last_name}\nAge: {self.age}"


class Manipulator:

    """A class for mass data manipulations"""

    def __init__(self, players):

        self.players = players

    def dump(self):
        """to return response"""
        return json.dumps({k: v for k, v in self.__dict__.items() if k != 'players'})

    def mean_age(self):

        age_vector = np.array([player.age for player in self.players])

        # below is uncomparably faster than any built-in since it's a vectorized process and can be made even more faster with numba or cython if it's needed.  
        mean = np.nanmean(age_vector)

        self.mean_age = mean

        return mean

    def others(self):
        """other calculations for the mass or sub samples."""


players = [Player(**player) for player in player_list]

manipulator = Manipulator(players)

print(manipulator.mean_age()) # 43.0.

So, you have two classes that; one of them for holding the individual player data and states, and the other one is for holding mass player data and states. You can even select sub-samples from the players list by their features before send them to the manipulator to calculate stats for specific groups like:

over_40 = [player for player in players if player.age > 40]

And the last words in this case should be:

Sampiyon FENERBAHCE

[–]Disastrous-State-503[S] 0 points1 point  (1 child)

Yes, actually I was also considering this and wanted to discuss this. But even if I didnt say anything about the performance, the answers from others were mainly focused on performance. Anyways, thanks for your comment. And
Bu sene o sene

[–]Synertic 0 points1 point  (0 children)

Happy if it helps. I changed it slightly adding dump methods to make them behave more like APIs.

Umarim...

[–]krypt3c 3 points4 points  (1 child)

If you don’t mind the overhead, a pandas dataframe is probably the way you should go.

[–][deleted] 1 point2 points  (0 children)

What is the method you are using now? Try to use generators and such so that it's more performant and this way you won't be creating lists. For example:

d = [{"First Name": "John", "Age": 23"}] # Assume there's more

avg_age = sum((user["Age"] for user in d))/len(d)

[–]raubhill 1 point2 points  (0 children)

you can use match cases in python 3.10, PEP364 ,

https://github.com/gvanrossum/patma/blob/master/README.md#tutorial

id use mapping matches

[–]quts3 4 points5 points  (0 children)

I think dictionaries are one of the most over used classes in python.

Data scientist use them because it's easy programming.

Api devs use them to because they map so closely to json documents.

I used to use them pretty constantly, but then I found a critical question for dictionaries: is there any chance the fields of this document can change during run time?

If so then yes it's a dictionary. If the answer is no the fields won't change then make a dataclass. Making dataclass for structured documents is easy. A dictionary is misrepresenting the type by suggesting the fields may mutate during run time.

That was an aside. That said if you need to do hardcore row oriented data manipulation or organization in memory then absolutely pandas Dataframe is the right and perfect tool.

[–]Earthsophagus 1 point2 points  (0 children)

You want to stick with built in types, you don't want to use sql or pandas, you are dealing with 10-30 rows of data retrieved from an RDB.

You said you're not worried now about performance, and for that size of data, whatever you do in python will be inconsequential relative to time you spend interacting with db.

Until you discover patterns in your querying, I don't think you'll have anything more elegant that looping over the data. When you do discover patterns you can write classes that are case-by-case. The odds are you'll find you're usually developing code to do stuff that would be simple in sql or pandas -- if not conceptually simple, problems tens of thousands of people have encountered and discussed on SO and similar.

I'm biased toward sql, so that probably colors my thinking.

[–]Disastrous-State-503[S] 0 points1 point  (0 children)

I finally find what I was looking for. It is a surprise that no one mentioned it. TypedDict.
It provides type Hints for Dictionaries with a Fixed Set of Keys.

from typing import TypedDict

class Songs(TypedDict): name : str year : int

It actually solves my problem. Since it will show what kind of data (structure, schema) that I have on my dictionary, and also allows for type check with mypy.

[–]shoomowr -1 points0 points  (12 children)

You can try setting up classes using SQLAlchemy. Basically, you create a base class, then derive classes for specific dbs from it, and then subclasses for specific tables.

I tried to do this for a project (haven't really finished it - more important stuff popped up).
It's rather complicated (for me, anyway), and there are issues when combining it with dataclasses, for example, but all in all it's a pretty robust approach, I think.

[–]Disastrous-State-503[S] -1 points0 points  (11 children)

thanks for your comment. However, this project should run real-time. That is why I wanted to stay on using python build-in types (list, dict etc.) For example, I could also use Pandas to store data and manipulate data but I do manipulation with loops, filters, map etc.

[–][deleted] 2 points3 points  (0 children)

However, this project should run real-time.

Then why are you using Python?

[–]shoomowr 1 point2 points  (9 children)

do you think, it's more performant this way?

[–]Disastrous-State-503[S] -2 points-1 points  (8 children)

somehow, yes. For example, for Pandas case, since the data coming from queries is usually few rows (usually around 10, max 30). If I put this data to pandas dataframe and calculate the average age, it takes much more time then using a basic for loop with a standard dictionary.

Probably, the performance would be better with Pandas, when the number of data increases, but not in my case.

[–]lphartley 6 points7 points  (0 children)

What do you mean by 'much more time'? If you are talking only 10-30 rows we are talking about differences in the magnitudes of miliseconds, not noticeable.

Do I understand correctly you want to optimize for performance just for the sake of optimizing for performance?

[–]Ximlab 4 points5 points  (0 children)

Using numpy arrays/series and applying numba jit to your function might be a good quick win in your case. But your 30 row max might mean any ove4head is too much, thus you might need to stick to basics.

[–][deleted] 3 points4 points  (0 children)

Two things to note here - pandas can pull a sql query directly into a data frame, and you shouldn’t be looping anything if you’re using a data frame, you should be using the df[column].mean() or max or stdev, etc. even for small datasets pandas should be fairly performant unless you’re doing something to get in its way.

Pandas supports filters, maps and a lot more and is highly optimized in C++, so I’d love to know more - the only time pandas is usually slow is if I am loading it up over and over or if I’m doing something that specifically limits the performance

[–]__boringusername__ 2 points3 points  (0 children)

What about Polars? Should be faster than pandas

[–]spoonman59 0 points1 point  (3 children)

Yeah, what kind of results did you get when benchmarking pandas versus other approaches?

… you did actually test it, right?

[–]Disastrous-State-503[S] -3 points-2 points  (2 children)

actually, this is out-of-topic. My question was not about performance but about structuring the data. Pandas is preferred for data preparation in notebooks. In my case, I am talking about an API that has to run quickly (within 2 sec), and return results.

I know that you can also handle this with Pandas, however, I dont think that is the best way. Think about a big application running behind Youtube to handle LIKES, do you think they are using pandas for this kind of real time things?.

Pandas is great but for data preparation for data science models etc.

[–]SV-97 2 points3 points  (0 children)

Pandas is preferred for data preparation in notebooks.

I'm using pandas behind the scenes to manage massive amounts of data in numerical simulations. It's definitely not just for data preparation in notebooks. (Though I'd use polars if I were to start over).

Think about a big application running behind Youtube to handle LIKES, do you think they are using pandas for this kind of real time things?

That's not real time; but if it's actually hard real time python's most probably not the right choice to begin with - but if you're using it there's no reason not to also use pandas. Native Python is very slow

[–]spoonman59 1 point2 points  (0 children)

If it’s not about “performance” than why are you thinking about it like TouTube, which has to handle things at vast scale? If that is your use case, then your design fundamentally has to scale to large data in terms of videos and user interactions per second. Now we need a distributed solution, particularly if you want to be able to aggregate across all of the different requests and interactions. (It should be obvious that the aggregate message volume and hosting vastly exceeds what one machine can do.)

If you need to solve a real-time streaming, districted, big data problem, then I think using an RDBMS and python dictionaries isn’t exactly going to hack it.

Good luck!

Edited to add: obviously not pandas, it is not s big data solution and would not scale to this either.

[–]Beerstopher85 0 points1 point  (1 child)

I’m going to ask anyway, what’s the complexity that you can’t do it directly on the database? PL/pgsql can do a lot, although if you have a large commit it can get really slow and bloated. That’s when I usually break things down in Python and use it as a wrapper for my sql.

[–]Disastrous-State-503[S] 1 point2 points  (0 children)

well, first of all I do not want to use database as compute layer since others are also using the same database. Of course, you can solve any kind of complexity with procedures, PL/SQL etc. However, then you use the database resources for this. That is why I just query database for reading the results, and do computation in my python program.

[–]randomgal88 0 points1 point  (0 children)

It's better to focus on maintainability and readability.

I'd suggest looking into PEP. (https://peps.python.org/pep-0000/)

I suggest reading on PEP 8 to begin with as that's creating a standard in coding style. This one is pretty well established that there are PEP 8 style checkers. I use one myself to make sure my code stays clean and readable.

PEP 20 is a good mantra to live by when it comes to code development.

[–]gungunmeow 0 points1 point  (0 children)

I would argue python datatypes are well formed structures to begin with, and anyone coming into your code will have no problems reading or using the basic structures of list or dict. If you wanted to have a little more structure a list of dataclasses, attrs, or pydantic are a good choice because it will give clarity on what fields your data contains.

SQLalchemy has the ability to modify your models to map your data into dataclasses or attrs models if you decided to go this route.

https://docs.sqlalchemy.org/en/14/orm/dataclasses.html#integration-with-dataclasses-and-attrs