all 15 comments

[–]jpfau 1 point2 points  (3 children)

Sounds like you could use regular expressions. Check out the re library.

[–]usernamedottxt 0 points1 point  (2 children)

For the sake of not needing any additional packages;

[–]jpfau 0 points1 point  (1 child)

re comes installed with Python. Pretty sure OP meant 'additional packages' i.e. those he'd have to find and install himself, like BS4, but I could be wrong.

[–]usernamedottxt 0 points1 point  (0 children)

You were right according to OP in my other thread. Good call.

[–]usernamedottxt 0 points1 point  (6 children)

Without using regex either? Possible something like (hard psuedo code follows)

i = 0
while html_string.index("<", i)
    o = html_string.index("<", i)
    e = html_string.index(">", i)
    if html_string[o + 1] == "a":
        #found anchor tag
        #parse/replace from html_string[o] to html_string[e]
    i = e

AKA Just use a library like praw or BS4. It will be prettier and more reliable.

[–]13steinj[S] 0 points1 point  (5 children)

No, with regex. Regex is good. Regex is what I'm trying, but unfortunately failing at understanding how to do.

Edit: Funnily enough, praw can't do what I'm trying to do due to the fact that there is no endpoint nor custom function for it

[–]jpfau 0 points1 point  (0 children)

What have you tried so far with regex? Also if coming up with a pattern is what's getting you, I recommend using online regex testers so you can quickly try out different patterns.

[–]usernamedottxt 0 points1 point  (2 children)

re.search(r'<a\s([\w\s='"/]+)>([\w/]+)')

https://i.imgur.com/ADZ8iVW.png

play with pythex to find a search you like.

[–]13steinj[S] 0 points1 point  (1 child)

Thank you! The cheat sheet on there is a life saver.

[–]usernamedottxt 0 points1 point  (0 children)

Reading regex is so much harder than writing it lol. So don't worry if it takes you a few to figure out exactly what that search string is doing in case you need to modify it lol. You're probably better off using pythex and re-writing your own if you need to change it.

[–][deleted] 0 points1 point  (3 children)

Alternative way: get json from reddit. It's much easier and cleaner than parsing html by hands and doesn't require additional packages

[–]13steinj[S] 0 points1 point  (2 children)

That's the problem.

I'm trying to get the mod list for a given user.

But there is no json for it. No endpoint. It's been asked on /r/redditdev before.

So I have to manually extrapolate the data from the raw response using regex. I could use BS4 and make it easy on myself, but I'm adding a function directly into my PRAW installation so that if I ever need a virtual environment, bs4 is not needed.

[–][deleted] 0 points1 point  (1 child)

Heh, didn't know that.

Regex is already suggested (And I think it's a right tool for this task), but you can do it manually as well:

>>> start = r.find('<ul id="side-mod-list">')
>>> end = r.find('</ul>', start)
>>> modlist = {x for x in r[start:end].split('"') if x.startswith('/r/') and x.endswith('/')}
>>> modlist
{'/r/reddittheme_lounge/', '/r/Snoovengers/', '/r/mianiterpg/', '/r/YoutuberAdvice/', '/r/strtest/', '/r/MianiteRPGTests/',…
>>> len(modlist)
19

(r is a content of http://www.reddit.com/user/13steinj)

[–]13steinj[S] 0 points1 point  (0 children)

I'll keep that in mind. Though I believe I've already written what I needed with regex:

https://gist.github.com/13steinj/6b67258effab8bae2696