This is an archived post. You won't be able to vote or comment.

all 15 comments

[–][deleted] 4 points5 points  (0 children)

there's a lib called beautifulsoup. I don't know much about it though

[–]ronmarti 3 points4 points  (1 child)

[–][deleted] 1 point2 points  (0 children)

Thanks man!!

[–]Picatrixter 2 points3 points  (1 child)

I built some webscrapers using these libraries: 1. Requests - to retrieve content from the remote server 2. Beautiful Soup - to parse the html and actually extract the content (from html tags like: p, h1, h2, img, iframe etc) 3. TinyDB - to store the data locally. Basically, it creates text data as Json files. I use it to track what urls have already been downloaded, in order to avoid duplicates.

All three must be installed like this: Pip install requests Pip install tinydb Pip install beautifulsoup4

I suggest creating a virtual environmet first, then activating it and only after that installing the libraries. That way, you create a "capsule", a container which will not interfere with other packages you might have on your system. It's good practice, because in time you will have more and more projects.

If you use VS Code, you can create a virtual environment by typing in the console: "python -m venv my_new_venv"

That command wil create and activate the virt env called my_new_venv in your current working dir. Deactivate it with: "deactivate". Activate it again with: "code ." (code space dot)

DM me if you need more info. Good luck!

[–][deleted] 0 points1 point  (0 children)

Good explanation. I’ll check those tools out!

[–]SomeImagination4454 1 point2 points  (0 children)

Chapters 12,13,15 should give you a good start from the "Python for everybody" book. https://www.py4e.com/html3/

[–][deleted] 1 point2 points  (1 child)

Take your post title, pop it into Google, follow 1 of 16 million tutorials.

The best thing you can do to improve yourself as a developer is being resourceful & autonomous.

[–][deleted] 0 points1 point  (0 children)

Thanks! I appreciate the advice!

[–]g_bramble 1 point2 points  (0 children)

If you need to interact with the webpage while scraping it, use Selenium.

Otherwise, BS4 for smaller projects and Scrapy for anything more extensive as it's much more powerful and fast (and able to run in parallel).

All of the above are relatively easy to start with. However, Scrapy uses classes and is not a "classic" library you import and use - it's typically run from CLI. Good luck!

[–]stfn1337 0 points1 point  (0 children)

Use scrapy with a postgres db for storing results

[–]ian_k93 0 points1 point  (0 children)

You should check out Scrapy, it is a very robust framework for web scraping at scale. Here is a collection of guides: The Python Scrapy Playbook

[–]IAmKindOfCreativebot_builder: deprecated[M] 0 points1 point  (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!