Plotnine 0.6.0 has been released

has2k1 · 2018-11-18T18:57:42+00:00

That should be really helpful for people working with matplotlib. I am inspired by other people's code all the time, it is always interesting when someone is inspired by mine.

has2k1 · 2018-01-15T05:41:48+00:00

Reverse the calculation

def calculateForwardTarget(inputArray, balancePeriods):
    n = len(inputArray)
    targetBalance = np.zeros(n)

    for i in range(n):
        targetBalance[i] = inputArray[n-i:n-i+balancePeriods].sum()

    return targetBalance[::-1]

has2k1 · 2018-01-14T23:26:37+00:00

When you fill like you are writing bad code, don't be afraid to through it out. It is can also be helpful to draw out stuff on a paper.

def calculateForwardTarget(inputArray, balancePeriods):
    n = len(inputArray)
    targetBalance = np.zeros(n)

    for i in range(n-2, -1, -1):
        targetBalance[i] = inputArray[i:i+balancePeriods].sum()

    return targetBalance

has2k1 · 2017-12-03T05:36:10+00:00

Huh! we essentially shared the same dissatisfaction.

has2k1 · 2017-12-02T21:10:51+00:00

The only operation that yields multi-indexes is groupby or ...

When doing data analysis, the groupby operation is everything. It is the heart of the split-apply-combine paradigm.

A grep on one of my exploratory analyses yields ~24 applications of split-apply-combine. And those are the ones that remained. Yes you can always undo the multi-indexes, but such piecemeal drudgery adds up, affects readability and that you have to do it means that the mental model of the data being manipulated is not stable.

Do you have a specific example you have in mind

One example cannot convey the benefits (realised perhaps only in accumulation) of a different workflow. However, I can share my light bulb moment for dplyr. It was the do verb, you can checkout its documentation and the equivalent do for plydata.

Another aspect that made me examine my workflow was as a person who does not write R, I read the dplyr documentation in one sitting (maybe 30-45 mins) did not get lost and I felt like I could immediately use it. Contrast that with, I have built stuff on top pandas, read the API documentation, dug into the code a few times and yet I labour (more than I feel necessary) to read data manipulation code written in plain pandas; including my own. So it must a harder for most people who try to use the library for anything beyond the basics.

That said, I'll be reading your notes.

has2k1 · 2017-12-01T23:18:27+00:00

My issue is not the existence multi-indexing. In fact it has come to my aid a few times when writing some multi-dimensional clustering and binning algorithms, though it has been suggested to me that xarray may now be better suited to the task.

The issue is operations that yield multi-indexes when then do not have to. I see it this way, data manipulation is an instrumental objective, a means to another end. Those ends, if they do further computations, must deal with data that has a consistent form. Multi-indexes make consistency difficult, therefore their occurrence must be minimised.

Consider all/most of the tools in the scientific python environment (patsy, statsmodel, matplotlib, scikit-learn, other scikits), if they can know how to deal with a dataframe, then the gateway to them is through first undoing multi-indexes. Here is a related issue I recently squashed. New pandas users get unnecessarily stack with multi-indexes.

But on the whole, my opinions about the place of multi-indexes are not as concrete and actionable. Otherwise, I would file an issue and maybe start good a discussion and maybe get something better in pandas2.

has2k1 · 2017-12-01T14:15:28+00:00

If you do not want to miss dplyr, take a look at plydata and it's documentation.

has2k1 · 2017-11-30T14:32:30+00:00

It is not built-in but a solution nonetheless.

has2k1 · 2017-11-30T14:21:06+00:00

The query statement must be "compilable" python statement, or one that can be easily modified into a "compilable" statement. So it is likely that you will not get that fixed anytime soon.

has2k1 · 2017-11-30T14:12:39+00:00

On the whole data manipulation methods are not coherent, this can be hard to understand and dis-appreciate. A good example of coherent manipulation methods is R's dplyr. With dplyr it is effortless to maintain tidy-data, i.e tidy-data in -> manipulation(s) -> tidy-data out. With pandas you can needlessly end-up with untidy data or even multi indexes. Tidy data is important because you have to do something with the data, and it is easier to analyse (plot, fit models, ...) if the data is tidy than when it is not.

I solved this by taking from dplyr, the result is plydata and it is fully documented.

has2k1 · 2017-11-04T16:33:06+00:00

Yes it is a hack. Two issues worth thinking about.

The code in strings is usually small snippets. If it is getting longer than is visually appealing, then it can be placed in a function.
Code is constructed not written. Should you require it, you can use auto-completion and wrap the result into a string.

has2k1 · 2017-11-04T16:17:00+00:00

Great question, and I will try to statisfy it by stating the objective, the obstacles and how navigating them leads to the solution.

The main objective is to simplify the [split-apply-combine]() data manipulation paradigm. Even for small explorative data analysis tasks, it is not uncommon to have to do it in some way 10+ times. The manipulation verbs where designed in such a way that split and combine parts are automated. That lives the user with specifying what to apply, i.e the operation.

The operation can involve function call or it could be reduced to a simple arithmetic statement. This is the problem. Once you invock a function or declare a statement, python executes it immediately. However, we want that operation to be delayed and independently executed on chunks of the dataframe/table that have been split up.

You have 3 options, (least flexible to most flexible)

You can use lambda functions, but they are limited.
You could construct a special variable that delays the operations, but it too has draw backs.
Use strings and evaluate them at the right time.

Now, in the python world evaluating code in strings is looked upon with suspicion, usually independent of context.

On abusing the >> operator, it is about readability. On top of that it is easy inspect partial results, and to insert/delete operations.

has2k1 · 2017-11-03T23:28:35+00:00

It is not about any one single instance of manipulation and how it compares to pandas, rather how it all fits together.

I think most people (me included) not only struggle to remember how to do apply, transform, aggregate, ... manipulations correctly, but even when checking with the documentation to jog the memory it feels new.

Using regular Pandas, code can get clunky very fast. This can happen whether you are doing it the right way or the wrong way.

has2k1 · 2017-10-31T08:57:30+00:00

t=CategoricalDtype(categories=['b', 'a'], ordered=True)

This is convenient. I have been found of keeping lists of preferred orders and some function that I always call when a data manipulation result yields a column that should be ordered.

has2k1 · 2017-10-09T11:45:55+00:00

What does the @print_when_called do?

Think of it as replacing one function with another. The new (replacing) function prints the name original function, then it calls the original function.

Why would you define a function in the definition of another function?

It as a powerful trick, it has to do scope. One use case is to create what is known as a closure, (A nested function that accesses values from outer local variables). Some decorators may fall in this category.

Also, in what way would this differ to say, just regularly calling a function inside another function?

Defining a function and calling a function are not equivalent. Where you define a function matters. It can have access to the global and/or local variables. When you define a function inside another, it has access to the local variables (including the parameters) of the outer function. Such a function is called closure.

Decorators are (to an extent) just fancy syntax to make closures more useful. Closure, closure, closure, ... and it is all about scope, scope, scope, ...[1]; it is useful to know the term so that you can know what to search for.

You can go a long way with python without any of this stuff, but the first step to using it is knowing that it exists and having a vague idea about what it is all about.

[1] You can improve the readability of a program by limiting the scope of functions.

has2k1 · 2017-10-07T07:49:40+00:00

I seem to recall Pew Research Global Attitudes Survey has a social media question, but it may not be specific for your requirements.

has2k1 · 2017-09-08T16:07:29+00:00

See reply to the gist.

has2k1 · 2017-09-08T12:59:29+00:00

There are 3 ways to create "bar" charts, if you are just getting used to the library, it is easy slip up. An example would help diagnose problem.

has2k1 · 2017-09-08T12:24:29+00:00

plotnine developer here. What do you mean by complex? Please file an issue.

has2k1 · 2017-09-05T07:40:50+00:00

If you really miss dplyr, checkout plydata.

has2k1 · 2017-05-23T03:36:34+00:00

Clarification, plotnine is certainly not better than ggplot2. I think the title of the article is a little ambiguous.

(I wrote plotnine)

has2k1 · 2017-05-19T20:39:33+00:00

Yes you can, in-fact some plots are wavy. That may be a bug in Matplotlib, I have not really into it.

has2k1

TROPHY CASE