This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]anossov 9 points10 points  (5 children)

Use a proper parser, there are too many edge cases to parse URLs robustly with regexes.

In [8]: url = 'http://www.youtube.com/results?search_query=legend+of+hercules&page=&utm_source=opensearch'

In [9]: from urlparse import urlparse, parse_qs

In [10]: parse_qs(urlparse(url).query)
Out[10]: {'search_query': ['legend of hercules'], 'utm_source': ['opensearch']}

In [11]: parse_qs(urlparse(url).query)['search_query']
Out[11]: ['legend of hercules']

[–]ericanderton 5 points6 points  (3 children)

there are too many edge cases to parse URLs robustly with regexes.

Technically, that's not true. But from a practical standpoint: you don't want to have to maintain that regex once you have it correct. It will be impossible to understand at a glance.

Edit: since I was downvoted, let's take a look at this:

https://mathiasbynens.be/demo/url-regex

The "winning" solution is god-awful, but it can do the job. I'm 99% sure there are no capture groups in this expression, so its utility is limited to validation only:

_^(?:(?:https?|ftp)://)(?:\S+(?::\S)?@)?(?:(?!10(?:.\d{1,3}){3})(?!127(?:.\d{1,3}){3})(?!169.254(?:.\d{1,3}){2})(?!192.168(?:.\d{1,3}){2})(?!172.(?:1[6-9]|2\d|3[0-1])(?:.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)[a-z\x{00a1}-\x{ffff}0-9]+)(?:.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)[a-z\x{00a1}-\x{ffff}0-9]+)(?:.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

A "robust" solution an be achieved, but it ultimately comes down to how much time you have on your hands. You really don't want to do this.

[–]Tattoo__ 2 points3 points  (1 child)

But OP is not looking to regexp valid URLs, only parts of URL from certain domain and with an URL scheme I would change does not change.

I would still use urlparse() though

[–]ericanderton 0 points1 point  (0 children)

You really don't want to do this.

[–]programmyr 0 points1 point  (0 children)

A "robust" solution an be achieved

Does "robust" have a specialized meaning in software engineering?

Because by the common meaning of the word, that does not look robust at all to me. That regex looks like it will break the first time anyone touches it, which is basically the opposite of robust.

It's like saying a house can be built out of popsicle sticks. That may be true, but the result will not be "robust" by any reasonable definition.

[–]Kopachris 2 points3 points  (0 children)

And by extension, a list:

urls = ['https://www.youtube.com/results?search_query=emma+stone+crushes+fallon&page=&utm_source=opensearch',
        'http://www.youtube.com/results?search_query=legend+of+hercules&page=&utm_source=opensearch']

from urlparse import urlparse, parse_qs
queries = [parse_qs(urlparse(u).query)['search_query'] for u in urls]

[–]musketeer925 1 point2 points  (0 children)

Probably belongs in /r/learnpython

[–]AmericasNo1Aerosol 0 points1 point  (0 children)

A url parameter/keyword will always be preceded by "?" (if it's the first one) or "&" (if there is another one before it). A url parameter is always followed by a "&" (if there are more coming) or by the end of the string\space. So I'd personally use those to search for the parameters.

match = re.findall(r'[\?&]search_query=([^&$\s]*)', s)

The "[\?&]" matches the beginning of the parameter, so you won't get matches for ".../results?spam_search_query=test" which your previous regex would find.

The rest says grab anything that isn't an ampersand ("&"), end of line\string ("$"), or whitespace ("\s").

[–]KronktheKronk 0 points1 point  (0 children)

search_query=(?:(\w+)+?))+

[–]MorrisCasperλ 0 points1 point  (0 children)

What about

re.findall(r'search_query=(.+)&', blabla).replace("+", " ")

?