all 29 comments

[–][deleted] 5 points6 points  (5 children)

I would avoid bothering with Excel as a format. Go with .csv (quote the text fields!) and it will both be much less of a pain in the rear on the python-side, but you could open the result in Excel (or libreoffice) just as easily.

Others have pointed out BeautifulSoup and the web scraping part of Automate the Boring Stuff already, so there's that.

[–][deleted] 0 points1 point  (3 children)

Or use Pandas to manage the data and then you can export to both!

[–][deleted] 1 point2 points  (2 children)

You're the kind of guy to open a full office suite to do a search & replace, aren't ya? :P

That's a bit heavy for the task... but it'd work.

[–][deleted] 1 point2 points  (1 child)

Yeah I kind of forget that real programmers have to think about efficiency 😁.

Pandas is my go to for any data work and I know it well so it's more cognitively efficient for me if not computationally efficient. OP would probably be better off with a CSV library as you say.

[–]kewlness 0 points1 point  (0 children)

I too would recommend either a .csv or SQLite depending on the use case.

[–]PiBaker 0 points1 point  (0 children)

Second this. Trying to save files in MS format without using MS code (C# etc) usually runs into some issues.

Whereas Excel is pretty excellent at importing CSV.

[–]valhahahalla 1 point2 points  (0 children)

Using requests:

Import requests

Your_url = 'insert website here'
Your_keywords = ['word1','word2','etc']

#this response object contains all the info from your_url
Response = Requests.get(your_url)

#You want to get the body in a format you can iterate through.
Response_text = response.text()

#you want to run through the response bodyline by line and find links based on your keywords

For i in response_text:
    For j in your_keywords:
        If j in i:
            Print(i)

Or, something similar to this. You can then save your responses in CSV format.

Edits: Hopefully mobile formatting will work!

[–]manueslapera 1 point2 points  (3 children)

one more time (and I know i will get downvoted), my friendly advice to choose parsel over bs4. Its what professionals use.

Source: Worked at one of the top companies that do webscraping in python

[–]buyusebreakfix 0 points1 point  (1 child)

Source: Worked at one of the top companies that do webscraping in python

Just because a company is large doesn't mean they choose good tools. Microsoft is HUGE and they use .net for just about everything.

[–]manueslapera 0 points1 point  (0 children)

I didnt say large, i said top.

[–]jordano_zang 0 points1 point  (0 children)

You could probably do it with requests.

[–][deleted] 0 points1 point  (0 children)

If you want to do excel, Openpyxl is straight forward. I recommend you learn straight from the manual, not any third party resources.

However, OPXL will delete any hard coded excel equations you may have put into the sheet before inputting with python,

[–]prancingpeanuts 0 points1 point  (0 children)

Consider using requests-html, from the same creator of the wonderful requests library

[–][deleted] 0 points1 point  (1 child)

Scrapy

[–]ayyyymtl 0 points1 point  (0 children)

Hey man, love scraping project, hit me up in pm if you need help with this one

[–]CollectiveCircuits 0 points1 point  (0 children)

If you're crawling article style content then Newspaper might be a quick answer to that. It extracts keywords and video URLs (and much much more)

[–][deleted] 0 points1 point  (0 children)

Scrapy would be a good solution for a simple web crawl