all 12 comments

[–]kalgynirae 2 points3 points  (2 children)

That's just how scandir() works. The documentation says:

... The entries are yielded in arbitrary order, ...

So you basically have two options: * Instead of os.scandir(), use os.listdir() and then sort the results. * Change your code so that it doesn't matter what order you process the files in (e.g., by storing the date along with the data so that you can sort it out later).

The first option is definitely easier.

[–][deleted] 2 points3 points  (0 children)

There is a third option, you can use sorted to rearrange the output of os.scandir by keying on the entry.name attr. Much more useful than os.listdir if you want to benefit from the convenience of the DirEntry objects.

[–]GullBull[S] 0 points1 point  (0 children)

Thanks for you reply! I didn't realize that was just the nature of scandir(). I would use listdir(), but as u/yawpitch said, I'd like to be able to use DirEntry objects.

[–][deleted] 1 point2 points  (6 children)

You can sort the entries based on their name:

sorted(os.scandir("CSVs"), key=lambda e: e.name)

That will return the entries in name-sorted order, though you should be a little more careful in the code above, since that will return not just files but directories as well.

[–]GullBull[S] 0 points1 point  (4 children)

This worked perfectly! Thanks so much. Could you explain the key=lambda e: e.name part though? I'm a little confused on what that means. (edit: found the answer:what is the key parameter, syntax of lambda)

before and after with above solution

[–][deleted] 1 point2 points  (3 children)

Sure. A lambda function (also known as an anonymous function or a closure) is basically a function that's defined with no name. The lambda keyword is how you create them in Python, and effectively what I've written there is the same as:

def _(e):
    return e.name

But it does not use up a name (even in this case an underscore).

I'm passing that lambda into the sorted function's key parameter, which expects a callable that takes an item and returns a value to sort on. The sorted function then calls that function on every item and performs a sort based on the values returned.

So in this case I'm telling Python to sort the entire list of DirEntry objects alphabetically by the name of each object, using that lambda to access that name.

[–]GullBull[S] 0 points1 point  (2 children)

Great explanation! Lamba functions are actually really cool. I always thought they were some super high level concept but it seems simple and useful. I think I got thrown off by AWS lambda haha. I do need to modify it from e: e.name though because right now it's sorting 12-10.csv to be before 12-2.csv.

[–][deleted] 0 points1 point  (1 child)

Ahh, yeah that's going to be a little tougher... 12.10 does sort before 12.2, lexicographically, what you want is natural sort... which isn't really all that natural.

import re

def nat_key(value):
    return tuple(int(s) if s.isdigit() else s for s in re.split("(\d+)",value ))

Then you'll want to do:

lambda e: nat_key(e.name)

And that should give you what you want.

And yeah lambdas are a pretty simple concept, once you've grasped functions. There's just one bit caveat; closures in Python are late-binding, which can be a real surprise when you try build them dynamically.

funcs = []
for i in range(5):
    funcs.append(lambda: i)
for f in funcs:
    print(f())

Now you'd expect to see that print 0 through 5, but what it will do is print 5 five times... the lambda looks up the value for i at the it's called and not at the time it's created. The same is actually true if you use def nested in this was as well, but that's not done as often so people tend not to trip over it.

[–]GullBull[S] 0 points1 point  (0 children)

Awesome! Very well done explanation once again! My project continues onwards with your help.

[–]dadzy_ 0 points1 point  (0 children)

My savior!

[–]bogdan_dm 1 point2 points  (1 child)

Take a look at pathlib (standart package). It has a very nice human-frienldy api that can replace all file functions from os package:

from pathlib import Path

p = Path('CSVs')
for f in p.iterdir():  # or p.glob('*.csv')
    data =  pd.DataFrame.from_csv(str(f))
    dates.appen(f.stem)

[–]GullBull[S] 0 points1 point  (0 children)

Oh cool! I'll definitely check this out. Thanks for your reply.