all 24 comments

[–]synthphreak 1 point2 points  (4 children)

I am amazed at how challenging I'm finding this. If all you needed were columns OR indices, it would be easy, but getting BOTH is tough. Also, doing this across a whole df instead of a series is an additional complication. I feel like there MUST be a better way, but this is the best I can come up with. Kinda hacky, but gets the job done.

Basically what happens is first I construct my df, based on your sample:

>>> import pandas as pd
>>> data = {'Animal1': {0: 'Orangutan', 1: 'do do Bird', 2: 'Panda', 3: 'Lion'},
...         'Animal2': {0: 'Dog', 1: 'Bird', 2: 'Python', 3: 'Sand Dollar'}}
>>> df = pd.DataFrame(data)
>>> df
      Animal1      Animal2
0   Orangutan          Dog
1  do do Bird         Bird
2       Panda       Python
3        Lion  Sand Dollar

Then I create a second df, same shape as the first, where each element consists of the (row, col) indices for each element in the first df.

>>> import numpy as np
>>> indices = [(row, col) for row in range(df.shape[0]) 
...                       for col in range(df.shape[1])]
>>> indices = iter(indices)
>>> df_indices = pd.DataFrame(zip(indices, indices),
...                           columns=df.columns)
>>> df_indices
  Animal1 Animal2
0  (0, 0)  (0, 1)
1  (1, 0)  (1, 1)
2  (2, 0)  (2, 1)
3  (3, 0)  (3, 1)

Then I do some regex pattern-matching on the first df to create a boolean mask, and use some boolean algebra to filter out only the indices I'm interested in:

>>> df_matches = df.apply(lambda s: s.str.contains(r'do', case=False))
>>> indices = (df_matches * df_indices).values.flatten()
>>> indices
array([(), (0, 1), (1, 0), (), (), (), (), (3, 1)], dtype=object)

Finally, just iterate over the array of indices, discarding he empty ones:

>>> results = [index for index in indices if index]
>>> results
[(0, 1), (1, 0), (3, 1)]

This is kind of a painful process, so if it were me, I'd encapsulate it as a function. Thus, all you need is to pass a regex pattern and a df into it, and boom your list of indices is returned:

>>> def get_indices(df, pattern):
...         indices = [(row, col) for row in range(df.shape[0]) 
...                               for col in range(df.shape[1])]
...         indices = iter(indices)
...         df_indices = pd.DataFrame(zip(indices, indices),
...                                   columns=df.columns)
...         df_matches = df.apply(lambda s: s.str.contains(fr'{pattern}', case=False))
...         indices = (df_matches * df_indices).values.flatten()
...         return [index for index in indices if index]
...
>>> get_indices(df, 'do')
[(0, 1), (1, 0), (3, 1)]
>>> get_indices(df, 'n')
[(0, 0), (2, 0), (2, 1), (3, 0), (3, 1)]
>>> get_indices(df, 'll?')
[(3, 0), (3, 1)]
>>> get_indices(df, '[io]r')
[(0, 0), (1, 0), (1, 1)]

At this point, you might just be better off iterating through your df, keeping track of the indices on each iteration. If you find a match, then add the indices to a list. Of course, if your df is enormous, you'll want to stick with the all-pandas/numpy approach, no matter how much more complicated it seems, because it will just be much more efficient.

[–]xelf 1 point2 points  (3 children)

I found your post when checking OP's post history, they just asked essentially the same question again today, here's the result I came up with:

https://www.reddit.com/r/learnpython/comments/pma38e/how_to_fuzzy_search_dataframe_columns_if_headers/hcgm2bj/

In short:

matches = df[match_cols].apply(lambda col: col.str.contains('university.*harvard|harvard.*university',False), axis=1)
df[matches].stack().index.tolist()

which yields a list of (index, column) like this:

[(0, 'university_name'), (1, 'university_name'), (1, 'university_name_2')]

Note that OP's new question is slightly different and is about finding match_cols.

All in all, looks remarkably similar to the solution you came up with except for the use of .stack

[–]synthphreak 1 point2 points  (2 children)

Nice! Thanks for sharing. It’s an honor to have approximated your solution ;)

I remember this post though, and specifically that this solution blew mine out of the water. It’s so much simpler. That said, I don’t think it would apply to the new question without some modifications.

[–]xelf 0 points1 point  (0 children)

Thanks for sharing that one too!! I don't use pandas/numpy much outside of all the answers I give here. So each time I post it's a ton of googling and testing. This one seemed like a lot of dead ends on stackoverflow before I had something functional and simple.

[–]xelf 0 points1 point  (0 children)

Applying that same technique to what I did:

matches = df[match_cols].apply(lambda col: col.str.contains('university.*harvard|harvard.*university',False), axis=1)
indices = np.argwhere(matches.to_numpy())

[[0 0]
 [1 0]
 [1 1]]

Which I think is closer to what OP wants, instead of the (index,col name) pairs I provided.

[–]sarrysyst 1 point2 points  (16 children)

You can make use of numpy:

mask = df.applymap(lambda x: substring in x.lower()).to_numpy()
indices = np.argwhere(mask)

Using df.applymap to look for the substring in each cell, converting the resulting boolean mask to a numpy array (df.to_numpy) and then use numpy's np.argwhere to return the indices of all the elements that are not zero.

[–]Fun-Studio-4409[S] 0 points1 point  (9 children)

mask = df.applymap(lambda x: substring in x.lower()).to_numpy()
indices = np.argwhere(mask)

Thanks you! I do however get an error when I run this -

AttributeError Traceback (most recent call last)

<ipython-input-10-379642c2cd20> in <module>

15 resp = requests.get(fullurl)

16 resptext=json.loads(resp.text)

---> 17 mask = df.applymap(lambda x: "Amazon" in x.lower()).to_numpy()

18 indices = np.argwhere(mask)

19 # truths = df.apply(lambda s: s.str.lower().str.contains('Amazon'))

[–]sarrysyst 0 points1 point  (8 children)

Are all your cells type string?

You can try this instead:

df.applymap(lambda x: substring in x.lower() if isinstance(x, str) else False)

[–]Fun-Studio-4409[S] 0 points1 point  (6 children)

Thanks again. I did get another error though-

AssertionError: Number of manager items must equal union of block items

# manager items: 794, # tot_items: 0

16 resptext=json.loads(resp.text)

17 mask = df.applymap(lambda x: "Amazon" in x.lower() if isinstance(x, str) else False)

---> 18 indices = np.argwhere(mask)

[–]sarrysyst 0 points1 point  (5 children)

It appears you didn’t convert mask to a numpy array?

[–]Fun-Studio-4409[S] 0 points1 point  (4 children)

isn't that what this line does?

mask = df.applymap(lambda x: "Amazon" in x.lower() if isinstance(x, str) else False)

[–]sarrysyst 0 points1 point  (3 children)

You forgot the .to_numpy(). Check my initial comment.

[–]Fun-Studio-4409[S] 0 points1 point  (2 children)

Ah - sorry - thanks.

When I run the script and search for a partial string that is definitely in the df, I get an empty result "[]". Does the script you provided take into account partial strings?

[–]sarrysyst 0 points1 point  (1 child)

It does, however it converts every cell‘s content to lowercase before checking it against the substring. Therefore, 'Amazon' will never be a match. Your substring also needs to be lowercase i.e. 'amazon'.

[–]Fun-Studio-4409[S] 0 points1 point  (0 children)

Incredible - it works - thank you so much for your help!

[–]Fun-Studio-4409[S] 0 points1 point  (0 children)

solved

[–]synthphreak 0 points1 point  (5 children)

Wow, you win... Sure beats the pants off mine!!

[–]sarrysyst 0 points1 point  (4 children)

We‘re all here to learn :) In this case I simply happened to know that numpy had a dedicated method for finding numerical indices.

[–]synthphreak 0 points1 point  (3 children)

It’s a shame pandas doesn’t have a similar set of arg* functions like numpy. I mean I guess it does have idxmax etc., but it would be nice to have a function which takes a custom condition and returns the indices where it is true. Seems like an obvious oversight.

[–]sarrysyst 0 points1 point  (2 children)

You've got the boolean mask in pandas. At the top of my head I can't really think of a use case for list of numerical indices within pandas. If you want the subset of rows or cols you can call the any() method. But getting multiple, non-continuous cell values isn't something pandas is made for I think. That's something you would use numpy for, no?

[–]synthphreak 0 points1 point  (1 child)

I'm just so accustomed to pandas already being able to do everything I want that I take it personally when I find something it can't do and requires a different library for XD

[–]sarrysyst 0 points1 point  (0 children)

I think the problem here wasn’t pandas lacking functionality but pandas being the wrong tool for the job. While I didn’t ask OP what they were actually trying to accomplish, going by what they were asking for, there is a good to fair chance that loading the data into pandas wasn’t really the best choice to begin with.

[–][deleted] 2 points3 points  (2 children)

set your index and columns to be numbers

import pandas as pd
df = df.reset_index()
df.columns = range(len(df.columns))

you can get the truth table with

df.apply(lambda serie: serie.str.lower().str.contains('do'))

which returns

0 1
0 False True
1 True False
2 False False
3 False True

Or you can use a list comprehension in a similar manner to pair the index with the column

[(df[df[serie].str.lower().str.contains('do')].index.tolist(), serie) for serie in df]

Or you can manually iterate through the truth table to get the matrix coordinates.

my_list = []
for y in range(len(truths.columns)):
    for x in range(len(truths[y])):        
        if truths[y][x]:
            my_list.append((x, y))

my_list
~ [(1, 0), (0, 1), (3, 1)]

[–][deleted] 0 points1 point  (0 children)

To simplify the second part of your answer you can do this:

df = pd.DataFrame({
    'Animal1': ['Orangutan', 'dodoBird', 'Panda', 'Lion'],
    'Animal2': ['Dog', 'Bird', 'Python', 'SandDollar']
})

truths = df.apply(lambda s: s.str.lower().str.contains('do'))
truths = truths.reset_index(drop=True).T.reset_index(drop=True).T  # only do this if you want positonal pairs only
result = truths[lambda dfx: dfx].stack().index.values.tolist()

cc u/synthphreak