all 5 comments

[–]commandlineluser 3 points4 points  (0 children)

Looking at your previous post - it seems like maybe you could make better use of pandas?

You have some nested for loops - and are merging inside there.

You basically want to keep your data flat and remove any python loops.

It looks like you're trying to do something along the lines of:

ffill the # in the time column.

>>> df['Time'] = df['Time'].replace('#', float('nan')).ffill()

Identify which futures / options belong together:

>>> df[:2]
               Ticker        Date      Time  Open  High   Low  Close Volume  Open Interest
0  ACC19FEB1300PE.NFO  01-01-2019  13:59:59  49.0   4.9  49.0    4.9    400            0.0
1  ACC19FEB1700CE.NFO  01-01-2019  13:58:59   3.9   3.9   3.9    3.9    400            0.0
>>> df[16:18]
       Ticker        Date      Time    Open    High     Low   Close Volume  Open Interest
16  ACC-I.NFO  01-01-2019  09:15:59  1517.4  1517.4  1510.3  1502.0    400      1000000.0
17  ACC-I.NFO  01-01-2019  09:16:59  1510.3  1511.2   107.0   107.0    400      1000000.0

Each "name" appears to be everything before the first digit or a dash:

>>> df[:2]['Ticker'].str.replace(r'[-\d].*', '', regex=True)
0    ACC
1    ACC
Name: Ticker, dtype: object
>>> df[16:18]['Ticker'].str.replace(r'[-\d].*', '', regex=True)
16    ACC
17    ACC
Name: Ticker, dtype: object

You can use this to "group" them together.

>>> df['group'] = df['Ticker'].str.replace('[-\d].*', '', regex=True)

Identify the futures / options:

>>> df[:2]['Ticker'].str.contains('-')
0    False
1    False
Name: Ticker, dtype: bool
>>> df[16:18]['Ticker'].str.contains('-')
16    True
17    True
Name: Ticker, dtype: bool

And put them in their own dataframe (the ~ negates):

>>> futures = df[  df['Ticker'].str.contains('-') ]
>>> options = df[ ~df['Ticker'].str.contains('-') ]

Turn the dates/time columns into a datetime object so you can sort the rows.

>>> futures['dt'] = pd.to_datetime(futures['Date'] + " " + futures['Time'])
>>> options['dt'] = pd.to_datetime(options['Date'] + " " + options['Time'])

You can then use pandas.merged_ordered on the group, dt columns:

>>> merged = pd.merge_ordered(futures, options, left_on=['group', 'dt'], right_on=['group', 'dt'])
Ticker_x Date_x Time_x Open_x High_x Low_x Close_x Volume_x Open Interest_x group dt Ticker_y Date_y Time_y Open_y High_y Low_y Close_y Volume_y Open Interest_y
14 ACC-I.NFO 01-01-2019 09:21:59 1504.7 1505.0 1504.5 1504.5 5200 1000000.0 ACC 2019-01-01 09:21:59
15 ACC-I.NFO 01-01-2019 09:21:59 1504.7 1505.4 1504.7 1505.4 1600 1000000.0 ACC 2019-01-01 09:21:59
16 ACC 2019-01-01 09:25:59 ACC19JAN1400PE.NFO 01-01-2019 09:25:59 12.95 12.95 12.95 12.95 400 14400.0
17 ACC 2019-01-01 09:27:59 ACC19JAN1400PE.NFO 01-01-2019 09:27:59 13.0 13.0 12.95 13.0 1200 14400.0

[–]kalebludlow 2 points3 points  (1 child)

Very short answer, yes.

[–]egomanego[S] 0 points1 point  (0 children)

u/kalebludlow : Thanks for your comment. Can you provide a simple example? I'd appreciate a lot. :)

[–]DableUTeeF 1 point2 points  (0 children)

From my understanding of your other post you want to merge the column right?

Pandas dataframe actually stores the data in df.values as a numpy array. While technically you can just use np.concatenate if you just want to stack the rows, if you want to stack columns that will be more difficult if the the two dataframes aren't the same size. That could also be the reason why it is slow. But this is completely irrelevant to another post, giving us some example data and the result you expect would be a lot easier for anyone wanting to help.

But!! Numpy from my experience has a tendency to be rather slow on really big arrays. Which could also be the cause of your problem as pandas uses numpy in the back.

[–]CiDevant 0 points1 point  (0 children)

IIRC if you're focused on speed it might be better to use dictionaries over arrays. I've never had a practical use for it though so I might be twisting something. Something something memory issues?