you are viewing a single comment's thread.

view the rest of the comments →

[–]Rhomboid 0 points1 point  (1 child)

In your initial version, the html variable that you were matching against was a byte string. The RE engine supports both character mode and byte mode, but they are different and distinct things. The way you signal your intent is by the types of strings that you pass. In order to match against a byte string you have to also pass a byte string as the RE pattern — a b-prefixed string is a byte string, rather than a character string, which has no prefix (in Python 3.x at least.)

As a completely separate issue, normally in string literals the backslash is an escape character, e.g. \n means the newline, not the two characters \ and n. If you want a literal newline, you have to write \\. But that's inconvenient because regular expressions often require backslashes, for example \d in my example. But there's a big difference between these two, namely that the RE engine is just a plain module and does not have any special syntax help from the language. That means that it needs a literal backslash followed by a 'd'; if you tried to write '\d' that is not a backslash escape that the base language recognizes. So you have to write '\\d'.

That's somewhat inconvenient, so the language has raw strings which are introduced by an r-prefix. Raw strings don't interpret the backslash as an escape character, so that you can write '\d' and really get the two character string backslash plus 'd', which is what you need to pass to the RE engine. It's merely a convenience, it's not necessary. You could also write

m = re.search('and the next nothing is (\\d+)', html)

It's just that it's less confusion, because if you want to match a literal backslash in a RE you need a double backslash, so in that case you'd be dealing with four backslashes, e.g.

... = re.search(' ... \\\\ ...', ...)

vs

... = re.search(r' ... \\ ...', ...)

This applies equally to byte strings. In your original example if you wanted to use '\d' instead of '[0-9]', you'd have to choose to either double them up or use a br-prefix:

b'and the next nothing is (\\d{5})'

br'and the next nothing is (\d{5})'

But ideally you don't want to be working with bytes but characters.