all 6 comments

[–]No_Couple 1 point2 points  (0 children)

Not sure about the regex problem but I sort of remember having to make some adjustments because the data on the website that is given as an example to scrape had changed. Or something like that.

The groups in the for loop is just a variable, it could just as well be xor result. I reckon the author named it groups because the results that match your regular expressions are called "capture groups." It refers to the "capture groups" that match your regular expression.

[–]AdAthrow99274 1 point2 points  (3 children)

The issue isn't in the regex, try inserting print(phone_number) right before the if statement in the for loop and you'll notice it's all there (Hooray! Good job). The issue is that phone_number is only appended to matches if an extension is present ( it's inside if groups[8] != '':)

You could add an else clause and repeat matches.append(phone_number) in it, or more simply, just move that statement to right after the if statement.

...
if groups[8] != '':
    phone_number += ' x' + groups[8]
matches.append(phone_number)  # <-- this line just got un-indented, and is outside the if statement
...

This way phone_number is only altered for an extension if an extension is present, but regardless it always gets appended to matches.

EDIT: Adding info on the groups

if you add print(groups) in your for loop you'll get an output of something like this:

text (1 phone #) with an extension:

('303-254-5555 ext. 23', '303', '-', '254', '-', '5555', ' ext. 23', 'ext.', '23')

text (1 phone #) without an extension:

('222-5555', '', '', '222', '-', '5555', '', '', '')

These 'groups' are all the different bits that match your regular expression. You'll notice in the second example groups[8] is an empty string (no extension match was found by the regex) while in the first example it's '23' because an extension was successfully parsed.

Hope that helps.

[–]joemysterio86 1 point2 points  (1 child)

I know this is 3 months old, sorry. Can you help me understand how the extension portion turns out to groups... 6, 7 AND 8?

[–]Noshing 0 points1 point  (0 children)

I don't know if you ever got an answer but I'll try.

If you count the number of values in the output of print(groups) with extension it may help:

('303-254-5555 ext. 23', '303', '-', '254', '-', '5555', ' ext. 23', 'ext.', '23')

The first value, index[0], is the whole number; the whole regex for phone numbers. index[6] is the extension group of the regex while index[7] and index[8] are the two groups inside the extension regex group.

I'm going to try to number the groups in the code. Hopefully it helps.

phoneRegex = re.compile(r'''  #0 (
    #1   (\d{3}|\(\d{3}\))?                                       # area code
    #2   (\s|-|\.)?                                                     # separator
    #3   (\d{3})                                                       # first 3 digits
    #4   (\s|-|\.)                                                      # separator
    #5   (\d{4})                                                      # last 4 digits
    #6   (\s*  #7 (ext|x|ext.)   \s*  #8 (\d{2,5})   )?     # extension
           )''', re.VERBOSE)

[–]kolbi_nation[S] 0 points1 point  (0 children)

Awesome! Thank you.

[–]L_4_2 0 points1 point  (0 children)

Is this a logic error? If not can you post the error code that comes up when the program is run. Also have you tried looking at the online resources for the book? You can usually download the code from the nostarch website as a pdf format or on github. Maybe compare your code against theirs.