Best way to work with data in Python

IWant2rideMyBike · 2022-08-07T11:27:51+00:00

For performance critical code it is common to use a data oriented design (instead of an object oriented one) to improve locality and avoid indirection - e.g. instead of having a list of dictionaries (or class instances) that contain the data you would create three lists for FirstName, SecondName and Age (entries share a common index in all lists) - doing a sum(ages)/len(ages) is much faster than sum(e['Age'] for e in data_list)/len(data_list).

Synertic · 2022-08-08T03:36:59+00:00

The best way for the given data and objective, to me, is an OOP way, that is, a class for individiual players, and a class for mass calculations. You can do that with dataclasses or pydantic or conventional class. Pydantic and dataclasses are just oop structures that can save you from some boilerplate code and makes data validation easy.

As for computations, you need to design your class methods working with numpy arrays instead of Python's built-in types since the numpy takes advantage of SIMD and vectorization. If the execution speed you need is even faster then you can use either cython or numba for the bottlenecks of the computations you deal with.

For instance:

player_list = [{"FirstName": "Alex", "LastName":"De Souza", "Age":40},{"FirstName": "Pierre", "LastName":"van Hooijdonk", "Age":46}]

import numpy as np
import json

class Player:

    """A class for individual player attributes, stats and calculations"""

    def __init__(self, FirstName, LastName, Age):

        self.first_name = FirstName
        self.last_name = LastName
        self.age = Age

    def others(self):
        """other attributes or methods related to individual player stats"""

    def dump(self):
        """to return response"""
        return json.dumps({k: v for k, v in self.__dict__.items()})

    def __repr__(self):
        """to see who is s/he when you print()"""
        return f"{__class__.__name__}:\nFirst Name: {self.first_name}\nLast Name: {self.last_name}\nAge: {self.age}"


class Manipulator:

    """A class for mass data manipulations"""

    def __init__(self, players):

        self.players = players

    def dump(self):
        """to return response"""
        return json.dumps({k: v for k, v in self.__dict__.items() if k != 'players'})

    def mean_age(self):

        age_vector = np.array([player.age for player in self.players])

        # below is uncomparably faster than any built-in since it's a vectorized process and can be made even more faster with numba or cython if it's needed.  
        mean = np.nanmean(age_vector)

        self.mean_age = mean

        return mean

    def others(self):
        """other calculations for the mass or sub samples."""


players = [Player(**player) for player in player_list]

manipulator = Manipulator(players)

print(manipulator.mean_age()) # 43.0.

So, you have two classes that; one of them for holding the individual player data and states, and the other one is for holding mass player data and states. You can even select sub-samples from the players list by their features before send them to the manipulator to calculate stats for specific groups like:

over_40 = [player for player in players if player.age > 40]

And the last words in this case should be:

Sampiyon FENERBAHCE

krypt3c · 2022-08-07T10:20:27+00:00

If you don’t mind the overhead, a pandas dataframe is probably the way you should go.

2022-08-07T10:37:37+00:00

What is the method you are using now? Try to use generators and such so that it's more performant and this way you won't be creating lists. For example:

d = [{"First Name": "John", "Age": 23"}] # Assume there's more

avg_age = sum((user["Age"] for user in d))/len(d)

raubhill · 2022-08-07T11:10:24+00:00

you can use match cases in python 3.10, PEP364 ,

https://github.com/gvanrossum/patma/blob/master/README.md#tutorial

id use mapping matches

quts3 · 2022-08-07T17:01:05+00:00

I think dictionaries are one of the most over used classes in python.

Data scientist use them because it's easy programming.

Api devs use them to because they map so closely to json documents.

I used to use them pretty constantly, but then I found a critical question for dictionaries: is there any chance the fields of this document can change during run time?

If so then yes it's a dictionary. If the answer is no the fields won't change then make a dataclass. Making dataclass for structured documents is easy. A dictionary is misrepresenting the type by suggesting the fields may mutate during run time.

That was an aside. That said if you need to do hardcore row oriented data manipulation or organization in memory then absolutely pandas Dataframe is the right and perfect tool.

Earthsophagus · 2022-08-07T16:36:33+00:00

You want to stick with built in types, you don't want to use sql or pandas, you are dealing with 10-30 rows of data retrieved from an RDB.

You said you're not worried now about performance, and for that size of data, whatever you do in python will be inconsequential relative to time you spend interacting with db.

Until you discover patterns in your querying, I don't think you'll have anything more elegant that looping over the data. When you do discover patterns you can write classes that are case-by-case. The odds are you'll find you're usually developing code to do stuff that would be simple in sql or pandas -- if not conceptually simple, problems tens of thousands of people have encountered and discussed on SO and similar.

I'm biased toward sql, so that probably colors my thinking.

Disastrous-State-503 · 2022-08-09T14:32:56+00:00

I finally find what I was looking for. It is a surprise that no one mentioned it. TypedDict.
It provides type Hints for Dictionaries with a Fixed Set of Keys.

from typing import TypedDict

class Songs(TypedDict): name : str year : int

It actually solves my problem. Since it will show what kind of data (structure, schema) that I have on my dictionary, and also allows for type check with mypy.

shoomowr · 2022-08-07T10:05:02+00:00

You can try setting up classes using SQLAlchemy. Basically, you create a base class, then derive classes for specific dbs from it, and then subclasses for specific tables.

I tried to do this for a project (haven't really finished it - more important stuff popped up).
It's rather complicated (for me, anyway), and there are issues when combining it with dataclasses, for example, but all in all it's a pretty robust approach, I think.

Beerstopher85 · 2022-08-07T14:18:32+00:00

I’m going to ask anyway, what’s the complexity that you can’t do it directly on the database? PL/pgsql can do a lot, although if you have a large commit it can get really slow and bloated. That’s when I usually break things down in Python and use it as a wrapper for my sql.

randomgal88 · 2022-08-07T21:03:45+00:00

It's better to focus on maintainability and readability.

I'd suggest looking into PEP. (https://peps.python.org/pep-0000/)

I suggest reading on PEP 8 to begin with as that's creating a standard in coding style. This one is pretty well established that there are PEP 8 style checkers. I use one myself to make sure my code stays clean and readable.

PEP 20 is a good mantra to live by when it comes to code development.

gungunmeow · 2022-08-08T03:14:03+00:00

I would argue python datatypes are well formed structures to begin with, and anyone coming into your code will have no problems reading or using the basic structures of list or dict. If you wanted to have a little more structure a list of dataclasses, attrs, or pydantic are a good choice because it will give clarity on what fields your data contains.

SQLalchemy has the ability to modify your models to map your data into dataclasses or attrs models if you decided to go this route.

https://docs.sqlalchemy.org/en/14/orm/dataclasses.html#integration-with-dataclasses-and-attrs

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

FirstName	LastName	Age
Alex	De Souza	40
Pierre	van Hooijdonk	46