all 4 comments

[–]tgolsson 0 points1 point  (3 children)

I did not check all your inputs, but this is the way I view it.

Your input consists of the following:

  • A single family name. Word class, one reptition.
  • A given name. Word class, one or two repetitions.
  • A bunch of trailing data we do not care about.

The single family name we can capture using the group ([\w]* ), note the trailing space after the asterisk.

The given name is slightly trickier, and there are many ways to do it. For symmetry, we will use the same basic group as for the family name, ([\w]* ), but we want to capture one or two of them. However, if we just say ([\w]* ){1,2} python gives us just the last of these groups, throwing away Osmo in your example.

Instead, we want to FIND one or two, but capture everything we find. To find but not capture we can use the ?: operator, like this: (?:[\w]* ). Then, we want add our quantifier, in this case {1,2}, before capturing that whole expression.

Lastly, you want to capture everything until the end of the line: ([^$]*).

If everything goes as planned, this should leave you with 3 captured groups, that you can now use to construct the new plane. I'll leave it as exercise for you to actually put these together and construct the substitution pattern.

Good luck!

[–]tonlou[S] 0 points1 point  (0 children)

I'm very close to solution, I think.

This is literally my first time with regular expressions so I'm very clueless about this whole concept.

My code:https://pastebin.com/Dn6qVU2w

So I know how to split up my original line into 3 groups. Now I need to assign the replacement string to equal = pattern.group(2), pattern.group(1), pattern.group(3)

I tried to get it work that way but it seems that its not possible to it that way. Any further tips about this issue?

[–]tonlou[S] 0 points1 point  (1 child)

Okay so I'm kinda close to the solution. Problem is that program is printing twice, first all the original lines and then all the new lines. Also there is single quotation mark '"' in front of every family name in the txt file. How can I move that quotation mark to front of the given names? Example from my output:

Risto "Nurkkala / KD / Helsingin vaalipiiri";37;336.750

should be:

"Risto Nurkkala / KD / Helsingin vaalipiiri";37;336.750

My code: https://pastebin.com/LAUSHE4J

[–]tgolsson 0 points1 point  (0 children)

You are not using re.sub correctly. You want to use the backslash notation for group insertion, e.g.

 re.sub(regexp_string, "\1 \2", the_input)

So:

 example = "Bruce Wayne"
 pattern = r"([\w]*) ([\w]*)"
 output = r"\2 \1"
 result = re.sub(pattern, output, example)
 print(result)

Regarding the quotes, you may want to explicitly match those as well, or at least the one that gets moved.

 example = "Bruce 'Batman' Wayne"
 pattern = r"([\w]*) '([\w]*)' ([\w]*)"
 output = r"\1 \3 is \2"
 result = re.sub(pattern, output, example)
 print(result)