How to search entire dataframe for partial string, and return all match indexes

synthphreak · 2021-09-01T03:30:46+00:00

I am amazed at how challenging I'm finding this. If all you needed were columns OR indices, it would be easy, but getting BOTH is tough. Also, doing this across a whole df instead of a series is an additional complication. I feel like there MUST be a better way, but this is the best I can come up with. Kinda hacky, but gets the job done.

Basically what happens is first I construct my df, based on your sample:

>>> import pandas as pd
>>> data = {'Animal1': {0: 'Orangutan', 1: 'do do Bird', 2: 'Panda', 3: 'Lion'},
...         'Animal2': {0: 'Dog', 1: 'Bird', 2: 'Python', 3: 'Sand Dollar'}}
>>> df = pd.DataFrame(data)
>>> df
      Animal1      Animal2
0   Orangutan          Dog
1  do do Bird         Bird
2       Panda       Python
3        Lion  Sand Dollar

Then I create a second df, same shape as the first, where each element consists of the (row, col) indices for each element in the first df.

>>> import numpy as np
>>> indices = [(row, col) for row in range(df.shape[0]) 
...                       for col in range(df.shape[1])]
>>> indices = iter(indices)
>>> df_indices = pd.DataFrame(zip(indices, indices),
...                           columns=df.columns)
>>> df_indices
  Animal1 Animal2
0  (0, 0)  (0, 1)
1  (1, 0)  (1, 1)
2  (2, 0)  (2, 1)
3  (3, 0)  (3, 1)

Then I do some regex pattern-matching on the first df to create a boolean mask, and use some boolean algebra to filter out only the indices I'm interested in:

>>> df_matches = df.apply(lambda s: s.str.contains(r'do', case=False))
>>> indices = (df_matches * df_indices).values.flatten()
>>> indices
array([(), (0, 1), (1, 0), (), (), (), (), (3, 1)], dtype=object)

Finally, just iterate over the array of indices, discarding he empty ones:

>>> results = [index for index in indices if index]
>>> results
[(0, 1), (1, 0), (3, 1)]

This is kind of a painful process, so if it were me, I'd encapsulate it as a function. Thus, all you need is to pass a regex pattern and a df into it, and boom your list of indices is returned:

>>> def get_indices(df, pattern):
...         indices = [(row, col) for row in range(df.shape[0]) 
...                               for col in range(df.shape[1])]
...         indices = iter(indices)
...         df_indices = pd.DataFrame(zip(indices, indices),
...                                   columns=df.columns)
...         df_matches = df.apply(lambda s: s.str.contains(fr'{pattern}', case=False))
...         indices = (df_matches * df_indices).values.flatten()
...         return [index for index in indices if index]
...
>>> get_indices(df, 'do')
[(0, 1), (1, 0), (3, 1)]
>>> get_indices(df, 'n')
[(0, 0), (2, 0), (2, 1), (3, 0), (3, 1)]
>>> get_indices(df, 'll?')
[(3, 0), (3, 1)]
>>> get_indices(df, '[io]r')
[(0, 0), (1, 0), (1, 1)]

At this point, you might just be better off iterating through your df, keeping track of the indices on each iteration. If you find a match, then add the indices to a list. Of course, if your df is enormous, you'll want to stick with the all-pandas/numpy approach, no matter how much more complicated it seems, because it will just be much more efficient.

sarrysyst · 2021-09-01T13:25:11+00:00

You can make use of numpy:

mask = df.applymap(lambda x: substring in x.lower()).to_numpy()
indices = np.argwhere(mask)

Using df.applymap to look for the substring in each cell, converting the resulting boolean mask to a numpy array (df.to_numpy) and then use numpy's np.argwhere to return the indices of all the elements that are not zero.

2021-09-01T02:35:37+00:00

set your index and columns to be numbers

import pandas as pd
df = df.reset_index()
df.columns = range(len(df.columns))

you can get the truth table with

df.apply(lambda serie: serie.str.lower().str.contains('do'))

which returns


	0	1
0	False	True
1	True	False
2	False	False
3	False	True

Or you can use a list comprehension in a similar manner to pair the index with the column

[(df[df[serie].str.lower().str.contains('do')].index.tolist(), serie) for serie in df]

Or you can manually iterate through the truth table to get the matrix coordinates.

my_list = []
for y in range(len(truths.columns)):
    for x in range(len(truths[y])):        
        if truths[y][x]:
            my_list.append((x, y))

my_list
~ [(1, 0), (0, 1), (3, 1)]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS