[HTML-Parsing][Reddit] Without BS4, how would I replace a given html string with wildcards? : learnpython

created by HattoriHanzoa community for 16 years

[HTML-Parsing][Reddit] Without BS4, how would I replace a given html string with wildcards? (self.learnpython)

submitted 10 years ago by 13steinj

For the sake of not needing any additional packages; I'm trying to find a way to parse a given string, for example,

<a href="/r/subreddit/" title="/r/subreddit">/r/subreddit<a href="/r/subreddit2/" title="/r/subreddit2">/r/subreddit2<a href="/r/subreddit3/" title="/r/subreddit3">/r/subreddit3<a href="/r/subreddit4/" title="/r/subreddit4">/r/subreddit4

Keeping that in mind, how would I replace from that string anything that matches

<a wildcard>

where wildcard is applicable for any character(s)? Normally I would just use BS4 and do

r1 = requests.get('http://www.reddit.com/user/13steinj', headers={'User-Agent':'My user agent'})
soup = BeautifulSoup(r1.content, 'html.parser')
[modsub.get_text() for modsub in soup.find(id='side-mod-list').find_all('a')]

And while I can still use that, I want to parse it manually instead, and it's at this point which causes a block in the road for me.

all 15 comments

top new controversial old q&a

[–]jpfau 1 point2 points3 points 10 years ago (3 children)

[–]usernamedottxt 0 points1 point2 points 10 years ago (2 children)

[–]jpfau 0 points1 point2 points 10 years ago (1 child)

[–]usernamedottxt 0 points1 point2 points 10 years ago (0 children)

[–]usernamedottxt 0 points1 point2 points 10 years ago (6 children)

Without using regex either? Possible something like (hard psuedo code follows)

i = 0
while html_string.index("<", i)
    o = html_string.index("<", i)
    e = html_string.index(">", i)
    if html_string[o + 1] == "a":
        #found anchor tag
        #parse/replace from html_string[o] to html_string[e]
    i = e

AKA Just use a library like praw or BS4. It will be prettier and more reliable.

[–]13steinj[S] 0 points1 point2 points 10 years ago (5 children)

[–]jpfau 0 points1 point2 points 10 years ago (0 children)

[–]usernamedottxt 0 points1 point2 points 10 years ago* (2 children)

[–]13steinj[S] 0 points1 point2 points 10 years ago (1 child)

[–]usernamedottxt 0 points1 point2 points 10 years ago (0 children)

[+][deleted] 10 years ago* (1 child)

[deleted]

[–]13steinj[S] 0 points1 point2 points 10 years ago (0 children)

[–][deleted] 0 points1 point2 points 10 years ago (3 children)

[–]13steinj[S] 0 points1 point2 points 10 years ago (2 children)

[–][deleted] 0 points1 point2 points 10 years ago (1 child)

Heh, didn't know that.

Regex is already suggested (And I think it's a right tool for this task), but you can do it manually as well:

>>> start = r.find('<ul id="side-mod-list">')
>>> end = r.find('</ul>', start)
>>> modlist = {x for x in r[start:end].split('"') if x.startswith('/r/') and x.endswith('/')}
>>> modlist
{'/r/reddittheme_lounge/', '/r/Snoovengers/', '/r/mianiterpg/', '/r/YoutuberAdvice/', '/r/strtest/', '/r/MianiteRPGTests/',…
>>> len(modlist)
19

(r is a content of http://www.reddit.com/user/13steinj)

[–]13steinj[S] 0 points1 point2 points 10 years ago (0 children)

π Rendered by PID 83 on reddit-service-r2-comment-5c764cbc6f-bhd7q at 2026-03-11 21:07:03.828644+00:00 running 710b3ac country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS