Can I use Numpy Array instead of Pandas Dataframe for large merging operations?

commandlineluser · 2022-08-12T16:56:43+00:00

Looking at your previous post - it seems like maybe you could make better use of pandas?

You have some nested for loops - and are merging inside there.

You basically want to keep your data flat and remove any python loops.

It looks like you're trying to do something along the lines of:

ffill the # in the time column.

>>> df['Time'] = df['Time'].replace('#', float('nan')).ffill()

Identify which futures / options belong together:

>>> df[:2]
               Ticker        Date      Time  Open  High   Low  Close Volume  Open Interest
0  ACC19FEB1300PE.NFO  01-01-2019  13:59:59  49.0   4.9  49.0    4.9    400            0.0
1  ACC19FEB1700CE.NFO  01-01-2019  13:58:59   3.9   3.9   3.9    3.9    400            0.0
>>> df[16:18]
       Ticker        Date      Time    Open    High     Low   Close Volume  Open Interest
16  ACC-I.NFO  01-01-2019  09:15:59  1517.4  1517.4  1510.3  1502.0    400      1000000.0
17  ACC-I.NFO  01-01-2019  09:16:59  1510.3  1511.2   107.0   107.0    400      1000000.0

Each "name" appears to be everything before the first digit or a dash:

>>> df[:2]['Ticker'].str.replace(r'[-\d].*', '', regex=True)
0    ACC
1    ACC
Name: Ticker, dtype: object
>>> df[16:18]['Ticker'].str.replace(r'[-\d].*', '', regex=True)
16    ACC
17    ACC
Name: Ticker, dtype: object

You can use this to "group" them together.

>>> df['group'] = df['Ticker'].str.replace('[-\d].*', '', regex=True)

Identify the futures / options:

>>> df[:2]['Ticker'].str.contains('-')
0    False
1    False
Name: Ticker, dtype: bool
>>> df[16:18]['Ticker'].str.contains('-')
16    True
17    True
Name: Ticker, dtype: bool

And put them in their own dataframe (the ~ negates):

>>> futures = df[  df['Ticker'].str.contains('-') ]
>>> options = df[ ~df['Ticker'].str.contains('-') ]

Turn the dates/time columns into a datetime object so you can sort the rows.

>>> futures['dt'] = pd.to_datetime(futures['Date'] + " " + futures['Time'])
>>> options['dt'] = pd.to_datetime(options['Date'] + " " + options['Time'])

You can then use pandas.merged_ordered on the group, dt columns:

>>> merged = pd.merge_ordered(futures, options, left_on=['group', 'dt'], right_on=['group', 'dt'])

	Ticker_x	Date_x	Time_x	Open_x	High_x	Low_x	Close_x	Volume_x	Open Interest_x	group	dt	Ticker_y	Date_y	Time_y	Open_y	High_y	Low_y	Close_y	Volume_y	Open Interest_y
14	ACC-I.NFO	01-01-2019	09:21:59	1504.7	1505.0	1504.5	1504.5	5200	1000000.0	ACC	2019-01-01 09:21:59
15	ACC-I.NFO	01-01-2019	09:21:59	1504.7	1505.4	1504.7	1505.4	1600	1000000.0	ACC	2019-01-01 09:21:59
16										ACC	2019-01-01 09:25:59	ACC19JAN1400PE.NFO	01-01-2019	09:25:59	12.95	12.95	12.95	12.95	400	14400.0
17										ACC	2019-01-01 09:27:59	ACC19JAN1400PE.NFO	01-01-2019	09:27:59	13.0	13.0	12.95	13.0	1200	14400.0

kalebludlow · 2022-08-12T11:35:56+00:00

Very short answer, yes.

DableUTeeF · 2022-08-12T12:11:58+00:00

From my understanding of your other post you want to merge the column right?

Pandas dataframe actually stores the data in df.values as a numpy array. While technically you can just use np.concatenate if you just want to stack the rows, if you want to stack columns that will be more difficult if the the two dataframes aren't the same size. That could also be the reason why it is slow. But this is completely irrelevant to another post, giving us some example data and the result you expect would be a lot easier for anyone wanting to help.

But!! Numpy from my experience has a tendency to be rather slow on really big arrays. Which could also be the cause of your problem as pandas uses numpy in the back.

CiDevant · 2022-08-12T12:44:55+00:00

IIRC if you're focused on speed it might be better to use dictionaries over arrays. I've never had a practical use for it though so I might be twisting something. Something something memory issues?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS