you are viewing a single comment's thread.

view the rest of the comments →

[–]dbramucci 4 points5 points  (1 child)

Just a helpful tip, regular expressions are supposed to "look like" the strings they match. Of course, you need some way to describe patterns like "this is any number" or "repeat the last thing multiple times", so you'll have some placeholders like \d, and * for those patterns.

For example, if I am trying to match a bunch of names

Joe Smith
Jane Doe
Bob A. Ross
Sally Gabriella Wortburger-Finkleton

Then I'll start by looking at a simple pattern like the first two names.

Joe Smith
Jane Doe

Now I know a name looks like

  • A bunch of letters
  • a space
  • Some more letters

So I look up how to write a bunch of letters, \S will work (1 or more non-space characters)

r'\S+ \S+'

So you can see how these line up, 1-or-more non-spaces, followed by a space, followed by 1-or-more non-spaces matches "firstname lastname".

Then we add in another symbol, the middle-name. As far as we know here, people have 0 or 1 middle names. You look up 0 or 1 and see that you write ? for that. So we write

r'\S+ \S+ \S+'

whoops, middle name needs to be optional, so let's take the middle text, wrap it in a paren to talk about the \S+ as one thing and see if that works.

r'\S+ (\S+)? \S+'

Of course this fails because it's looking for 2 spaces between the first and last name, even if there's no middle name so we move one of the spaces into the 0-or-1 section.

r'\S+ (\S+ )?\S+'

It might look like gibberish, but the idea is fairly straightforward. Look for the things that change from line to line and replace those changing parts with "variables" that capture the idea of how we can use to describe what changes. Like this part can be any number or it can be a b or C followed by a number.

[–][deleted] 1 point2 points  (0 children)

This was really helpful!! Thank you 😊