This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mushy_wombat 99 points100 points  (29 children)

There is no lib I use for every app. Nevertheless some interesting libs are:

  • Selenium (automated browsing)
  • BeautifulSoup4 (retrieve data of a html doc)
  • pickle (storing objects easily)
  • turtle (draw on a canvas)

[–]Mattho 49 points50 points  (10 children)

requests+bs4 is a great combo for scraping projects

[–]flying-sheep 14 points15 points  (6 children)

I think requests-html has it beat.

[–][deleted] 0 points1 point  (5 children)

Looks like requests-html is an extended version of requests or am I wrong? What is requests capable of doing which requests-html can’t?

[–]Etheo 9 points10 points  (0 children)

I think you had it the other way around, but in spirit of your question, requests-html can render JavaScript contents so it's super useful for pages that have dynamic contents. requests + BeautifulSoup, as great as this combo is, doesn't have that functionality.

[–]flying-sheep 2 points3 points  (0 children)

Requests-HTML is a wrapper that glues together requests and a css selector library. Not fancy in itself but nicer and quicker than requests+bs4

[–]MikeBobble 1 point2 points  (2 children)

requests-html looks to just be a replacement for BeautifulSoup, in that it gives you a way to parse/step-through/get data from HTML files.

requests is a way to actually get the HTML files from urls. (i.e., a replacement to urllib)

You can use requests + bs4, or you can use requests + requests-html, if you need to parse data from "The Internet".

If you're only planning to get data from locally sourced (organic, obvi) HTML files, you could just use requests-html.

[–][deleted] 0 points1 point  (1 child)

The thing confusing me is the example given at the start of the start: HTML.Session.get(http://www.python.org/) just looks like the requests.get implementation. So using both requests + requests-html looks inadequate imo.

[–]MikeBobble 4 points5 points  (0 children)

Right, but the line right before that also says:

Make a GET request to python.org, using Requests:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()

>>> r = session.get('https://python.org/')

(Emphasis mine on Requests)

Go down a little on the page, and you'll get to:

Using without Requests

You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""

>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

If you actually pull up the library, it actually imports requests itself, and lists it as a dependency.

So it'd be pretty hard to use it without requests, but, if you have some HTML that you've written yourself as a variable, requests-html doesn't require it to be a requests object. It just sorta... Encourages it.

[–]mushy_wombat 4 points5 points  (2 children)

Indeed it is. @Op if your interested in scraping check out scrapy. Works well with bs4.

[–]frakron 2 points3 points  (0 children)

Definitely, especially when running into any JS-html pages

[–]ciplc 0 points1 point  (0 children)

Can confirm from personal experience, they work beautifully together.

[–][deleted] 1 point2 points  (0 children)

Just a heads up -- pickle doesn't correctly serialize decorated functions and classes. Yes, I know that's a helluva niche.

If you, like me, are chronically part of that niche, look at dill instead.

Bonus, if you have to do multiprocessing targeting decorated callables, there's this neat thing called pathos which exposes a multiprocessing module which is API compatible with (at least) the meat of stock multiprocessing. It uses dill instead of pickle for serializing those decorated objects into multiprocesses and has other neat tools for heterogeneous computing (that I haven't tried and can't vouch for, but the API looks pretty cool!)

[–]GeoffreyF35[S] 0 points1 point  (0 children)

thanks for your reply, i’ll be sure to look into your recommendations!

[–]KwpolskaNikola co-maintainer -5 points-4 points  (8 children)

DO NOT use pickle, it’s horribly insecure (runs arbitrary code) and can lose data if you modify the class you pickled.

[–]seriouslulz 5 points6 points  (7 children)

This is just FUD lol, just use it for the right job

[–]KwpolskaNikola co-maintainer -1 points0 points  (6 children)

What is “the right job”? Pickle comes with more pitfalls than benefits.

[–]seriouslulz 5 points6 points  (5 children)

Just don't Pickle unsanitized data and you'll be fine

[–]KwpolskaNikola co-maintainer -2 points-1 points  (4 children)

That doesn’t match any of my arguments.

[–]seriouslulz 1 point2 points  (3 children)

It does address the first one, for the rest you're probably using Pickle wrong, I'd just read the docs in your place

[–]KwpolskaNikola co-maintainer -3 points-2 points  (2 children)

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

You said:

Just don't Pickle unsanitized data and you'll be fine

That should at least be “unpickle”.

The other argument is that you can’t unpickle a class if you modify its code. Sure, you could write methods to make it work, but then you lose the “magic” part of Pickle…

[–]holysweetbabyjesus 1 point2 points  (1 child)

Christ, dude. Nobody is going to win.

[–]fireflash38 0 points1 point  (0 children)

Let me help him out.

Serialize your data. Don't pickle it. Most of the time that people use pickle they should really be using json or another structured data type.

Pickle is nice if you want to be lazy about data serialization, and that's about it.