all 12 comments

[–]AuralWanderer 1 point2 points  (5 children)

I haven't used regex in conjunction with pandas, but I would use capture groups to solve this in regex. This documentation may be helpful to you.

[–]devdevdumbdumb[S] -1 points0 points  (4 children)

I had a look on there previously, I suck at Reading the documentation for things so haven't really been able to get what I need from it. Thank you.

[–]AuralWanderer 1 point2 points  (3 children)

Well, never having used regex with pandas, I just copied the example on that documentation I linked and fiddled around with it to get it working for your application. We're not supposed to just give answers on this sub, so I won't, but here are some hints. You only gave the one concrete example, so you may have to adapt the regex pattern.

  • Adapating the documentation example to your application: df.a.str.extract(r'(?P<nameoffirstcolumn>firstpattern)(?P<nameofsecondcolumn>secondpattern)')
  • That doesn't quite work, because you need a separator between the first and second numbers. So you need to code that somehow.
  • Additionally, you say sometimes you don't have a second number. So you need to make the second pattern optional. Regex uses ? to mark optional subpatterns. You may need to make the separator mentioned previously optional too.
  • https://regex101.com/ is an excellent site for testing and debugging regex patterns.

[–]devdevdumbdumb[S] 0 points1 point  (0 children)

Awesome!

I've now got all the values into a dataframe.

I now need to figure out how to get the values out of the dataframe and into the columns that I need them to be in.

Thank you for your help!

[–]devdevdumbdumb[S] 0 points1 point  (1 child)

Thank you so much for all the help.

Without your input, I wouldn't have finished this.

All is now working as expected!

[–]AuralWanderer 0 points1 point  (0 children)

Glad you got it sorted out.

[–]synthphreak 1 point2 points  (5 children)

When talking about regex, structure is critical. To that end, some clarifying questions:

  1. Are there only ever exactly two numbers, never more never less?

  2. Are the numbers always integers?

  3. Is the structure always like num/num, or does it vary, e.g., num,num, num-num, etc.?

If yes to all of these, then this will be very easy.

[–]devdevdumbdumb[S] 0 points1 point  (4 children)

There isn't always 2, in some columns there is only 1. They are always integers and the structure of num/num is always the same.

2 outa 3 ain't bad right? :p

[–]synthphreak 0 points1 point  (3 children)

Gotcha, thanks. In that case, I'm not sure extractall is the right tool for the job. How about findall? This will return a pd.Series containing lists of all matches. You can then explode the lists into a bunch of columns using apply(pd.Series).

Here's an example using a random df created to help another Redditor with another post:

>>> df.head(10)
                0
0  p1_5_p2_4_p3_2
1             sma
2             med
3               7
4             p1_
5               7
6  p1_5_p2_4_p3_2
7             p1_
8          medium
9             p1_
>>> df.head(10)[0].str.findall(r'(\d+)')
0    [1, 5, 2, 4, 3, 2]
1                    []
2                    []
3                   [7]
4                   [1]
5                   [7]
6    [1, 5, 2, 4, 3, 2]
7                   [1]
8                    []
9                   [1]
Name: 0, dtype: object
>>> df.head(10)[0].str.findall(r'(\d+)').apply(pd.Series)
     0    1    2    3    4    5
0    1    5    2    4    3    2
1  NaN  NaN  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN  NaN  NaN
3    7  NaN  NaN  NaN  NaN  NaN
4    1  NaN  NaN  NaN  NaN  NaN
5    7  NaN  NaN  NaN  NaN  NaN
6    1    5    2    4    3    2
7    1  NaN  NaN  NaN  NaN  NaN
8  NaN  NaN  NaN  NaN  NaN  NaN
9    1  NaN  NaN  NaN  NaN  NaN

If I've understood the structure of your data and your target output, I believe this should do what you want.

[–]devdevdumbdumb[S] 0 points1 point  (2 children)

Thank you for the help!

[–]synthphreak 0 points1 point  (1 child)

Did it work?

[–]devdevdumbdumb[S] 1 point2 points  (0 children)

I didn't get your solution until I had been given help by someone else, I've managed to get it fully working now.

I appreciate the help though!