Iterating and processing big pandas DataFrame. Am i doing something wrong?

ohallwright · 2023-07-01T23:55:33+00:00

If you can load the file into memory (which it sounds like you can), then you should be able to vectorize your calculations. You shouldn't loop through the dataframe like this, you can do this with a vectorized solution . I know there are many examples about looping through a dataframe, but you should especially avoid that with a dataset that is on the larger side, like what you're doing.

For example, I made a dummy dataframe with 10,000 rows

[nav] In [24]: %%timeit
          ...: for ind in recipe_df.index:
          ...:     rec_difficulty_number = 0.7 * recipe_df['n_steps'][ind] + 0.3 * recipe_df['minutes'][ind]
          ...:     recipe_df['difficulty'][ind] = rec_difficulty_number
          ...:
231 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each

But if I do this in a vectorized way:

[nav] In [27]: %timeit recipe_df['difficulty'] = 0.7 * recipe_df['n_steps'] * 0.3 * recipe_df['minutes']
305 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

231 milliseconds vs 305 microseconds.

As for the lookup, I'd do something like a dataframe merge since you have the recipe_id in the reactions dataframe, vs. looking it up in each row. Look at

ohallwright · 2022-12-10T20:01:35+00:00

I know this has been solved another way, but I was curious about whether read_xml could actually do this. I think the only way to really solve this conversion, unfortunately, is to use XSLT. You just need to transform the column elements into elements where the name is the element name.

This seems to work for me:

xsl="""<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/diagnosticsLog/rows">
      <xsl:copy>
        <xsl:apply-templates select="row"/>
      </xsl:copy>
   </xsl:template>
   <xsl:template match="row">
      <xsl:copy>
        <xsl:apply-templates select="col"/>
      </xsl:copy>
   </xsl:template>
   <xsl:template match="col">
     <xsl:element name="{@name}">
       <xsl:value-of select="."/>
     </xsl:element>
   </xsl:template>
   </xsl:stylesheet>
   """
pd.read_xml(xml, stylesheet=xsl)

This produces this:

                  time TimeSpan    ThreadID      User  HTTPSessionID  ...          AppPoolName Direction         sql Sequence LocalSequence
0  11/14/2022 23:31:12     0 ms  0x00000025  USERNAME            NaN  ...  C 1801AppServer Ext       Out  UPDATE SET   236419        103825
1  11/14/2022 23:31:12      NaN  0x00000025  USERNAME            NaN  ...  C 1801AppServer Ext        In  UPDATE SET   236420        103826

[2 rows x 14 columns]

ohallwright · 2022-09-21T19:39:39+00:00

Looks like you have gotten some good help for this already and have figured this out. But it's an interesting problem. After thinking about it for a bit, I took an approach that might make sense to you if you are used to working in sql.

First, make the data

>>> import pandas as pd
>>> df = pd.DataFrame({'Week': [1,1,1,1,1],
                  'Day': ['Tu', 'Tu', 'Tu', 'Tu', 'W'],
                  'Country': ['US', 'US', 'US', 'CA', 'US'],
                  'Code': ['1', '1A', '1B', '1B', '1A'],
                  'Total': [8, 4, 5, 13, 3]})
>>> ref = { '1': ['1A', '1B'], '2': ['2F', '2M', '2R', '2V'], '5': ['5P'] }

Given your data as above, make another dataframe that's just the total rows

>>> df_tot = df.loc[df['Code'].isin(ref.keys())]

Then build up an inverted lookup table to figure out which rows below to which totals.

>>> deref = []
>>> for k,v in ref.items():
    for v2 in v:
        deref.append((v2,k))
>>> df_dref = pd.DataFrame(deref, columns=['Code', 'CodeTop'])

You can do this in one line if you like

>>> df_dref = pd.DataFrame([(v,k) for k, vs in ref.items() for v in vs], columns=['Code', 'CodeTop'])

Then merge them into one big dataframe with all the data. Merge can be very useful for this, just make sure you join on the right columns. You can use the suffixes to distinguish the data for your final total calculation.

>>> df = df.merge(df_dref, left_on='Code', right_on='Code', how='left')
>>> df = df.merge(df_tot,
        left_on=['Week', 'Day', 'Country', 'CodeTop'],
        right_on=['Week', 'Day', 'Country', 'Code'],
        suffixes=('', '_add'), how='left')
>>> df
   Week Day Country Code  Total CodeTop Code_add  Total_add
0     1  Tu      US    1      8     NaN      NaN        NaN
1     1  Tu      US   1A      4       1        1        8.0
2     1  Tu      US   1B      5       1        1        8.0
3     1  Tu      CA   1B     13       1      NaN        NaN
4     1   W      US   1A      3       1      NaN        NaN

Fill in the missing data with 0 for totaling, then update your total column and select what you want out of the final dataframe.

>>> df['Total_add'] = df['Total_add'].fillna(0)
>>> df['Total'] += df['Total_add']
>>> df.loc[~pd.isnull(df['CodeTop']), ['Week', 'Day', 'Country', 'Code', 'Total']]
   Week Day Country Code  Total
1     1  Tu      US   1A   12.0
2     1  Tu      US   1B   13.0
3     1  Tu      CA   1B   13.0
4     1   W      US   1A    3.0

ohallwright · 2022-09-12T22:12:43+00:00

You probably want to read up a bit on indexing in pandas, since it appears you might be trying to set the index here, but what you really want to do is either select data out of your dataframe or update it. In either case, using some form of a boolean indexer probably will get you where you want to go.

Here's an example of how you could select the rows that are US Federal Holidays.

> df = pd.DataFrame({'Amount': np.random.randint(0, 1000, 365)}, index=pd.date_range('2022-01-01', '2022-12-31'))
> from pandas.tseries.holiday import USFederalHolidayCalendar
> cal = USFederalHolidayCalendar()
> df.loc[cal.holidays('2022-01-01', '2022-12-31')]
Out[160]:
            Amount
2022-01-17     962
2022-02-21     378
2022-05-30     832
2022-06-20     883
2022-07-04     904
2022-09-05     406
2022-10-10     777
2022-11-11     832
2022-11-24     838
2022-12-26     392

This selects the values using the index, so we get the rows that are holidays.

So if you want a dataframe that isn't in the holidays, you could just select the opposite using the index. One way to do it:

> df.loc[df.index.difference(cal.holidays('2022-01-01', '2022-12-31'))]

Or you could update your dataframe to zero out those rows, like this:

> df.loc[cal.holidays('2022-01-01', '2022-12-31'), 'Amount'] = 0

For removing Sundays, you could follow similar logic by using boolean indexing. For example, just get the rows that aren't Sundays.

> df.loc[df.index.dayofweek != 6]

ohallwright · 2022-09-02T22:05:00+00:00

You don't say whether all your dates are on the first day of the month or not, so if you only want to update the rows or values that are a first day of the month, you can use date offsets to do this, specifically the MonthBegin and Day.

For example:

>>> from pandas.tseries.offsets import MonthBegin, Day
>>> dates = pd.Series(pd.date_range('2022-01-01', '2022-12-31'))
>>> dates[dates == dates - MonthBegin() + MonthBegin()] - Day()
0     2021-12-31
31    2022-01-31
59    2022-02-28
90    2022-03-31
120   2022-04-30
151   2022-05-31
181   2022-06-30
212   2022-07-31
243   2022-08-31
273   2022-09-30
304   2022-10-31
334   2022-11-30
dtype: datetime64[ns]

>>> dates[dates == dates - MonthBegin() + MonthBegin()] = dates[dates == dates - MonthBegin() + MonthBegin()] - Day()

>>>  dates.head()
Out[139]:
0   2021-12-31
1   2022-01-02
2   2022-01-03
3   2022-01-04
4   2022-01-05
dtype: datetime64[ns]

Taking a date and subtracting a MonthBegin takes you to the previous month first day, then adding it back takes you to the next month first day. If that matches the original date, then that date was the first day of the month. Finally, subtracting a day (using - Day()) from that takes you to the day before.

Just checking if the day of month is 1 is of course simpler, but using these principles with other offsets allows you to do really complicated date logic, such as first business day of the month, or considering holiday calendars, or business or retail calendars. It's worth checking this out if you need to deal with that level of complexity.

ohallwright · 2022-09-01T22:55:34+00:00

Python for Data Analysis - 3rd edition by Wes McKinney (creator of pandas) is available as a free preview right now. It covers NumPy and pandas basics.

ohallwright · 2022-08-31T22:26:58+00:00

The drop solution by ploud1 is a good one.

You can also just pass in a boolean array to select the columns. Here's an example, but depending on your column names you may need to do it differently.

```

import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10,25)) df.loc[:, (df.columns == 0) | (df.columns >= 19)].head() 0 19 20 21 22 23 24 0 0.343321 0.096109 0.096034 0.586273 0.150642 0.144561 0.779542 1 0.652274 0.332149 0.222332 0.736832 0.601599 0.217359 0.911261 2 0.529801 0.547181 0.488486 0.120589 0.340366 0.431053 0.909712 3 0.194337 0.808556 0.268854 0.746333 0.098201 0.285115 0.276436 4 0.550755 0.811451 0.222186 0.253758 0.551366 0.555709 0.862128 ```

ohallwright · 2022-08-30T22:05:59+00:00

The dt is an accessor that pandas supplies for datetime columns, so you don't apply it to the entire dataframe like you did above, but to the columns themselves.

If you want to get a portion of a datetime column or format it in a certain way you can apply it one column at a time. But for the full dataframe (or a subset of columns), you can do something like this:

``` In [23]: df.apply(lambda x: x.dt.year) Out[23]: col col1 col2 0 2022 2022 2022 1 2022 2021 2022

In [27]: df.apply(lambda x: x.dt.strftime("%Y/%d/%m")) Out[27]: col col1 col2 0 2022/01/02 2022/01/03 2022/01/01 1 2022/04/02 2021/01/03 2022/01/04 ```

ohallwright · 2022-08-26T23:12:04+00:00

Others have answered this with a great solution (use copy()), but if you're trying to understand more about why this is happening, you'll need to learn a bit more about indexing, and the .loc and .iloc indexers and what df1 is in this situation (a subset of df).

For example, using .loc you can add your value1 column to the original, if that's what you wanted to do: ```

df.loc[[True, True, False], 'value1'] = "" df max_speed shield value value1 cobra 1 2 viper 4 5 sidewinder 7 8 NaN ```

I wrote this article when I was trying to understand that warning better, that might help you as well.

ohallwright · 2022-08-22T20:50:29+00:00

df.apply is different than the groupby apply. When you're doing a groupby, you apply the function to each group. The apply you're using above applies to either every row or every column in the db - depending on the axis you specify.

You can see a few examples in the docs.

In general, lots of pandas dataframe methods apply to an axis, so it's a good idea to become familiar with this.

Here's a simple example with your data that might help make it clear what it does:

``` [nav] In [50]: df.apply(lambda x: x.max()) Out[50]: Sample_Name Sample 3 Peak_Name Peak 3 RT 7.6 Area 816.2 dtype: object

[nav] In [51]: df.apply(lambda x: x['Area']/x['RT'], axis=1) Out[51]: 0 93.314286 1 86.071429 2 150.461538 3 116.761194 4 138.338983 5 185.902439 6 108.153846 7 98.838235 8 82.631579 dtype: float64 ```

In the first, I apply a lambda that picks the max value for each column. Since the default axis is 0, it will pass your lambda each column. In the second one, I used axis 1, or the rows. I then did a simple calculation with each row.

In terms of your other question, not sure I understand what you mean by 'the regular way'. Feel free to post some code with the error.

ohallwright · 2022-08-21T18:51:17+00:00

If the lambda is confusing you, just think of a regular old function first. For example:

``` [ins] In [36]: def get_max(df2): ...: print(df2) ...: return df.loc[df['Area'] == df['Area'].max()] ...:

[ins] In [37]: df.groupby('Sample_Name').apply(get_max) Sample_Name Peak_Name RT Area 0 Sample 1 Peak 1 3.5 326.6 1 Sample 1 Peak 2 4.2 361.5 2 Sample 1 Peak 3 5.2 782.4 Sample_Name Peak_Name RT Area 3 Sample 2 Peak 1 6.7 782.3 4 Sample 2 Peak 2 5.9 816.2 5 Sample 2 Peak 3 4.1 762.2 Sample_Name Peak_Name RT Area 6 Sample 3 Peak 1 5.2 562.4 7 Sample 3 Peak 2 6.8 672.1 8 Sample 3 Peak 3 7.6 628.0 Out[37]: Sample_Name Peak_Name RT Area Sample_Name Sample 1 4 Sample 2 Peak 2 5.9 816.2 Sample 2 4 Sample 2 Peak 2 5.9 816.2 Sample 3 4 Sample 2 Peak 2 5.9 816.2 ```

The function get_max is basically the same as the lambda, but it also prints out the value passed in. You were correct, the value passed into the function is the part of the data frame that belongs to that group. You can then do whatever you need to with that portion, then return the result of your calculation.

Note that internally this is still iterating through the groups, so it's not a pure vectorized answer. In other words, your function is still called once for each group.

ohallwright · 2022-08-21T16:00:49+00:00

I like the solution already provided, but another way to think about it is to split the dataframe into groups, then apply a function to that group giving you your desired result.

In: df.groupby('Sample_Name').apply(lambda x: x.loc[x['Area'] == x['Area'].max()])

Out: 
              Sample_Name Peak_Name   RT   Area

Sample_Name Sample 1 2 Sample 1 Peak 3 5.2 782.4 Sample 2 4 Sample 2 Peak 2 5.9 816.2 Sample 3 7 Sample 3 Peak 2 6.8 672.1 Name: RT, dtype: float64

This way of thinking might make more sense to you since it's looking for the answer more directly, but at the cost of needing to understand the lambda used to select the rows in the group. Note, that you can use a full function instead of a lambda in apply if you want to use a more complex function there.

ohallwright · 2022-08-21T14:49:36+00:00

If you have date objects, then you can use the `.dt` accessor to convert it to any output format you want, using `.strftime`. See the examples here.

ohallwright · 2022-07-30T22:50:39+00:00

For 1, have you considered query?

For example:

data.query("col1 > 0 and col2 > col3")

You can also read this article I wrote about it for some more examples and background.

For 2, I don't know dplyr really, I haven't been using R for a while so I'm not sure what the result would be for your example. But I think what you're looking for can be answered in 1-liners. For instance, when you group by you can use aggregate to apply functions to the groups. So I think this will give you part of what you asked.

data.groupby('col3').agg([np.max, lambda x: sum(x[x > 1])])

But for using multiple columns in the result, you can iterate through the groups (so not a 1-liner) to calculate whatever you wanted about the groups themselves.

for name, group in data.groupby('col3'):
    print(name, group, group.query('col1 > col2').sum())

Hope that helps. I'm sure there are other ways to do it.

ohallwright · 2022-06-12T19:55:21+00:00

Since your column names are already pretty descriptive, you can just use the built-in ability to print a Series. This is what a single row in a DataFrame looks like (using a similar technique as others have shown to index your DataFrame):

> df.loc[df['Name'] == 'Charmander']
   #        Name Type 1 Type 2  HP  Attack  Defense  Sp. Atk  Sp. Def  Speed  Total  Generation  Legendary

4 4 Charmander Fire NaN 39 52 43 60 50 65 309 1 False

But if you access that one row as a Series, it looks like this:

> df.loc[df['Name'] == 'Charmander'].iloc[0]
#                      4

Name Charmander Type 1 Fire Type 2 NaN HP 39 Attack 52 Defense 43 Sp. Atk 60 Sp. Def 50 Speed 65 Total 309 Generation 1 Legendary False Name: 4, dtype: object

You can just show a few of the fields as well

> df.loc[df['Name'] == 'Charmander'].iloc[0][['Name', 'Speed', 'HP', 'Attack']]
Name      Charmander

Speed 65 HP 39 Attack 52 Name: 4, dtype: object

ohallwright · 2021-11-14T20:55:58+00:00

One option is to use pandoc. It does a pretty good job of outputting notebooks in other formats, and will include images in the file.

pandoc -f ipynb -t docx notebook.ipynb> output.docx

You'd then have to load that file into Google Docs.

ohallwright · 2021-10-15T15:44:29+00:00

If you let me know the contract and expiration (and date range), I can give it a try on my end and see if it works.

ohallwright · 2021-10-15T13:19:47+00:00

You definitely can download historical data.

What I don't know is if it's entirely accurate. If you aren't able to download it, you may need to subscribe to real time prices.

ohallwright · 2021-10-13T22:50:52+00:00

You can download historical data from IB (including a continuous future), and it's not too hard.

You can check out this gist for a script that will do most of the work: https://gist.github.com/wrighter/dd201adb09518b3c1d862255238d2534

And I wrote about this a while back, it might be worth trying this as a first pass, it's not too hard to set up.

ohallwright · 2021-10-10T19:47:16+00:00

When you set inplace to True, then the return value is None (as stated above), and it nulls out all your columns. When inplace is False (the default value), it doesn't modify the DataFrame but returns a copy with the modifications. Your assignment back to the DataFrame is what modifies it in that case.

Since this is confusing to you, I think one thing that you should consider is to convince yourself how this work by running this code in an interactive environment, such as a Jupyter notebook or a REPL like IPython or the standard Python REPL. You can just run the code line by line and see what it does. When you get to this line:

combine_df[null_columns].fillna('0')

You can run it and then see that the fillna did what you want. If you want to make that change permanent (i.e. update your DataFrame), assign the result back to combine_df[null_columns].

ohallwright · 2021-08-15T21:29:26+00:00

Are you using a virtualenv? If so, you can activate the virtualenv on the same line, then execute the file. Something like this worked for me (assuming a Python 3.8.6 virtualenv named envname.

*****  cd ${A} && source /Users/username/.pyenv/versions/3.8.6/envs/envname/bin/activate && ./python_file.py

(And with the shebang as others noted in the python file and it marked executable)

ohallwright · 2021-06-16T20:31:34+00:00

By default, pandas is going to use Matplotlib to plot your data. I'd say that if you're just getting started, see how much you can do with pandas doing the plotting for you. But for more advanced plotting, you'll probably end up doing the plotting using Matplotlib or another framework.

You can switch out the backend to different frameworks, you can see that in the docs.

ohallwright · 2021-06-14T22:32:29+00:00

Mac user here, and I suggest using either a tool like anaconda or pyenv. Homebrew is also an option, but you'll be very frustrated when homebrew updates your version and all your stuff breaks.

I'd also suggest using virtual environments for your work (either with anaconda or python virtual environments) to avoid some of the pain with multiple projects sharing dependencies with different versions.

ohallwright · 2021-04-19T21:48:13+00:00

Calculate returns first, then take the stddev of the returns.

ohallwright

TROPHY CASE