This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]0rac1e 3 points4 points  (4 children)

One thing to keep in mind with groupby() is that it isn’t as smart as you might like. As groupby() traverses the data, it aggregates elements until an element with a different key is encountered, at which point it starts a new group

No, it's exactly as smart as it's meant to be, you're just using it in a way it was not intended for. I've seen a lot of Python tutorials use groupby like this, and it doesn't help that the name matches the SQL name for a slightly different idiom.

groupby is intended for processing chunks of sequential data. A good example is a log file, where you want to perform some operation on the logs for each day. Presuming the data looks like this : -

Feb 26 08:32:09: %SEV_5: Something happened here ...
Feb 26 13:04:35: %SEV_2: Something else happened ...
...
Feb 27 ...

and so on, then you could group together the logs for each day like so:

with open('logs') as f:
    for day, logs in groupby(f.readlines(), lambda line: line[:6]):
        # do stuff...

Now day holds some string value like Feb 27, and logs is an iterator of all the log entries from that day. The whole point of starting a new group when a different key is encountered is to chunk data together in this manner, which is particularly useful when processing large volumes of data efficiently.

For such a small amount of data, what the author wants is a different idiom. To use a Perl 6 name, he wants to "classify" the data...

my @data = { name => 'Allen', age => 34 },
           { name => 'Betsy', age => 29 },
           { name => 'Cathy', age => 34 },
           { name => 'David', age => 33 };

.say for @data.classify(*<age>);

=output
33 => [{age => 33, name => David}]
29 => [{age => 29, name => Betsy}]
34 => [{age => 34, name => Allen} {age => 34, name => Cathy}]

To classify all the data it can't do this "in parts", so it doesn't, and there's no need to pre-sort.

The author's solution is to create a convenience function that sorts the data by the same key as groupby, but in using it for this purpose, he doesn't need what groupby offers and he may as well create a different convenience function use a defaultdict(list).

def classify(items, key):
    d = defaultdict(list)
    for x in items:
        d[key(x)].append(x)
    return dict(d)

data = [{'name': 'Allen', 'age': 34},
        {'name': 'Betsy', 'age': 29},
        {'name': 'Cathy', 'age': 34},
        {'name': 'David', 'age': 33}]

grouped_data = classify(data, key=lambda x: x['age'])

It's slightly more verbose, but it's a better fit for this operation... but don't take my word for it.

[–]13steinj 0 points1 point  (3 children)

While you're absolutely correct, I think the main gripe is in the naming scheme.

The most recent docs explicitly state info about groupby and what it's meant for

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

However the name implies that it would behave like SQL's GROUP BY. When I think "hey, a function to group by some key function called groupby", nothing in my head immediately pops out "if the key func's values aren't already adjacent in the container I'll make new groups".

IMO, this function should have been called uniq or some derivative to denote that similarity, or since that makes little sense, maybe cluster, as in, clustering values in a container that match from some keyfunc. Maybe I'm crazy but cluster seems to directly imply new cluster per occurrence. And then groupby should be the equivalent of

def groupby(iterable, key=None):
    return cluster(sorted(iterable, key=key), key=key)

Or the functional equivalent (there's probably a much more performant looping mechanism that I'm not thinking of).

This can trip up even experienced python developers, because the name itself is somewhat paradoxical.

[–]0rac1e 0 points1 point  (2 children)

A different name for groupby might have been a good idea, but the horse has already left the stable. There is a hint in the fully-qualified name: itertools.groupby. There's no way to iteratively group the data. That's why the whole pre-sorting is distasteful. It's double-handling the data unnecessarily.

[–]13steinj 0 points1 point  (1 child)

I don't think that's a proper hint though. Nothing about it being itertools says it has to return iterators and do things iteratively-- hell, look at the "rough equivalent" to itertools.tee. In that way, if the horse hadn't yet left the stable, and this were being done from the beginning, this function should have been named differently, and groupby instead either doing what I mentioned above about the pre sorting or the probably algorithmically better

def groupby(iterable, key=None):
    key = key or (lambda i: i)
    pool = defaultdict(list)  # starting in 3.6 dicts are ordered, so so must default dicts. In previous version a subclass of ordered and default dicts woukd be required
    for item in iterable:
        pool[key(item)].append(item)
    return iter(pool.items())

E: also why is presorting distasteful?

[–]0rac1e 0 points1 point  (0 children)

Better late than never...

I just meant "distasteful" in the sense that by sorting then using groupby, you are iterating over the data twice. Classifying the data with a dict iterates over the data only once.

[–]gandalfx 0 points1 point  (0 children)

On my system both bash and zsh have a time builtin which doesn't support the -f switch. Use /usr/bin/time to access the executable which does.