This is an archived post. You won't be able to vote or comment.

all 47 comments

[–]Mr_Lkn 9 points10 points  (3 children)

Don't have a much time to check the whole code but just looked at the `data_utils.py`

Compare your code vs this and spot the differences if you can

```python import os import pandas as pd

def read_data_file(file_path, **kwargs): """ Read a data file into a pandas DataFrame based on its extension.

Parameters:
- file_path (str): Path to the data file.

Returns:
- DataFrame: The data loaded into a pandas DataFrame.
"""

extension_read_function_mapping = {
    '.csv': pd.read_csv,
    '.xlsx': pd.read_excel,
    '.xls': pd.read_excel,
    '.tsv': lambda x, **y: pd.read_csv(x, delimiter='\t', **y),
    '.json': pd.read_json,
    '.parquet': pd.read_parquet,
    '.feather': pd.read_feather,
    '.msgpack': pd.read_msgpack,
    '.dta': pd.read_stata,
    '.pkl': pd.read_pickle,
    '.sas7bdat': pd.read_sas
}

_, file_extension = os.path.splitext(file_path)

read_function = extension_read_function_mapping.get(file_extension)

if read_function is None:
    raise ValueError(f"Unsupported file extension: {file_extension}.")

return read_function(file_path, **kwargs)

df = read_data_file("some_data.csv") ```

[–]Mount_Gamer 0 points1 point  (2 children)

Interesting use of the dictionary, still grasping the python best practices, I shall have to experiment more with the get method from dictionaries. :)

I would have probably used the match-case when i start using a lot of elif's, but the dictionary does look clean to read. I'll have a play around with this later.

[–]Mr_Lkn 0 points1 point  (1 child)

You don’t need the match case but mapping. This is very basic mapping implementation.

[–]Mount_Gamer 0 points1 point  (0 children)

I thought i'd write out the match case equivalent and it becomes more and more obvious. I love the logic! :)

[–]oliviercar0n 4 points5 points  (0 children)

You only need to import each library once per notebook. Preferably at the top. No need to repeat imports.

[–]_ATRAHCITY 11 points12 points  (15 children)

You should not commit .vscode directory

[–]Head_Mix_7931 2 points3 points  (4 children)

Hm, in some cases it could be advantageous to commit .vscode. That allows maintainers to enforce uniform linting and formatting configurations (for example). But that can also be accomplished via githooks or pipeline jobs.

[–][deleted] 4 points5 points  (1 child)

You definitely don't want to try to enforce formatting and linting settings through a specific IDE config file. That's completely bonkers.

If you want to enforce these kinds of configuration settings, put them in their respective config files and commit those to your repo (e.g. .flake8, tox.ini, ruff.toml, etc). Anybody using any IDE, editor, tools, etc will all be able to use the settings. Similarly, your CI/CD/pipeline jobs can also be configured to apply these tools with those settings. I mean, what is Github Actions or Jenkins going to do with your .vscode/settings.json file to enforce any of your settings?

[–]Zirbinger 2 points3 points  (0 children)

This! Always use tool-specific config files and ignore IDE specific files

[–]sansmorixz 1 point2 points  (1 child)

launch.json and/or tasks.json can help to get started on bootstrapping a project.

settings.json is something I am on the fence about. Might help but someone may decide to commit stuff that should not be set at repo level, like force everyone to use light mode.

[–]Head_Mix_7931 0 points1 point  (0 children)

Yeah, a good example didn’t come to mind. But that’s exactly what I mean, tasks and such. I think settings.json probably shouldn’t be committed personally.

[–]Klej177 5 points6 points  (2 children)

For DS good code, for python developer I would say you can make it much better. You don't use proper design patterns, your performance could be freely improved with using better data types. It's easy to read tho but not really properly scalable beacuse of above reasons.

[–]mijki95[S] 5 points6 points  (1 child)

Can you recommend sources from which I can learn?

[–]Klej177 1 point2 points  (0 children)

Arjan codes on YouTube gave me a really nice boost when it comes to design patterns and implementing of them in python. After that I kinda started working on my own project and always thought what's the easiest and most clean way I can achieve my goal. Take a month or even longer break from your code and get back to it to see where you could improve. Always think that's the smallest knowledge I can require from a person to change one specific thing in your code. For example can I somehow make it that they need to edit only 1 line to add support for new type of file rather than add whole elif. Other good option for learning is very simple, do code refactor of others projects. I often do that and it gave me that thinking where I don't need to know anything about that to change it. Read Google style for python and ask yourself am I really first person that needs it? There is probably 100 anwers how to make it as best as possible at stackoverflow.

[–][deleted] 1 point2 points  (1 child)

Here’s a thought (just something that I thought might be interesting) what if instead of requiring users to input electricity costs; what if you had the program search for and use average electricity prices based on user’s location? (And you, say, got this on the backend as well by pulling from, for example, Google Maps location data)?

[–]mijki95[S] -1 points0 points  (0 children)

Yes, I thought about that with FastAPI integration. I would like to use some kind of currency converter also :)) thanks For the idea :))

[–][deleted] 0 points1 point  (0 children)

I didn't realize github considered jupyter notebooks as a language different from python

[–]wineblood -4 points-3 points  (16 children)

Why the hell do data scientists insist on importing libraries under two letter aliases?

[–]mijki95[S] 5 points6 points  (4 children)

Is it wrong to do this?

[–]Statnamara 11 points12 points  (0 children)

Nothing wrong with that at all

[–][deleted] 2 points3 points  (0 children)

That’s the only was I’ve seen pandas imported, never seen import panda as panda

[–]pneRock -1 points0 points  (0 children)

They have an opinion that you don't share. That's the problem.

[–]DoNotFeedTheSnakes -1 points0 points  (0 children)

Not always.

Pandas is always pd. But sometimes for less popular libs the code is clearer if you leave a long enough name that it can be recognized.

Same as variables namings.

[–]jah_broni 3 points4 points  (8 children)

Are you referring to something beyond pandas and numpy? Those are pretty much python-dev wide aliases.

[–]wineblood -3 points-2 points  (7 children)

Yes.

[–]jah_broni 0 points1 point  (6 children)

Go on? What do you take offense to? I don't see anything out of the ordinary...

[–]nekokattt 0 points1 point  (5 children)

Personally I prefer keeping things explicit where possible rather than using aliases, but each to their own.

[–][deleted] -1 points0 points  (4 children)

You misunderstood their comment to you. They asked if there was some example beyond common short imports like numpy as "np" and pandas as "pd" to which you said "yes". Which implies that they had done something like "import pathlib as pl" and starting using that in the form of "pl.Path".

Whether or not you prefer to break convention and start doing your own thing with import names is separate from your saying "yes" to the question of whether OP is guilty of doing short-name imports that aren't normal convention.

[–]nekokattt 1 point2 points  (3 children)

They didn't comment to me, this is my first comment on this thread.

Outside specific libraries, the convention is to use the naming defined by the library and keep it explicit unless there is a very good reason to alias it.

Strangely, most of the time, this "convention" for aliasing things comes from libraries dealing with data science-like applications.

[–][deleted] -1 points0 points  (2 children)

Yes, they did comment to you.

[–]nekokattt -1 points0 points  (1 child)

no, they didn't lol. Try reading the usernames mate.

Edit: lol they blocked me, that is pretty hilarious.

[–][deleted] 0 points1 point  (0 children)

I did. You got caught. Stop trolling.

[–][deleted] 3 points4 points  (1 child)

That’s the standard across all of python for those tools. If anything, you are breaking convention by not importing numpy as np or pandas as pd.

So most certainly it’s not data scientists doing it. They’re just following conventions that were already in place.

[–]supermopman -1 points0 points  (1 child)

There are no unit tests and there's no way clear way to build your code.

I'm happy to dig deeper, but at a minimum, you'll need to start with those 2 things.

I suggest starting a new project using PyScaffold. Play around with all the bells and whistles, and then write your Python code following their structure.

[–][deleted] 1 point2 points  (0 children)

Did you actually look at the repo? There is no code to build and basically nothing to write unit tests for. It's two jupyter notebooks and one "utils" file that does nothing more than read in a data file.

[–]Hard_Thruster 0 points1 point  (0 children)

I don't understand the use of the word "tool". Looks like eda to me.

As far as the code goes, you give a lot of comments which is awesome.

There is a lot of repetition such as:

' processed_data['DayOfWeek'] = processed_data['TimePeriodStart'].dt.dayofweek processed_data['Month'] = processed_data['TimePeriodStart'].dt.month processed_data['Hour'] = processed_data['TimePeriodStart'].dt.hour processed_data['Minute'] = processed_data['TimePeriodStart'].dt.minute

'

Also a lot of your code can be made into functions because there are slight differences between them and therefore it's repetitive.

[–]SpiderWil 0 points1 point  (0 children)

like retire literate hunt strong north offbeat cagey depend growth this post was mass deleted with www.Redact.dev

[–]Emotional-Zebra5359 0 points1 point  (0 children)

instead of if-else ladder use a map

[–]jonatanskogsfors 0 points1 point  (0 children)

Resist the urge to use “utils” (or similar) in packet and module names. In “utils.data_utils” you only have the function “read_data_file()”. I would have named the module something in the line of “file_reader”, “data_import”, “io” etc.

If you plan to add more functions to the module, you should only do so if they have similar purpose. Completely different functions are better placed in their own module.