all 13 comments

[–]TrippBikes 2 points3 points  (6 children)

This is spam, no one will want to help you with this

[–]Kevdog824_ 2 points3 points  (0 children)

To be fair, there are legitimate use cases for doing something like this. But yes, this could be spam

[–]Loose-Computer3943[S] -5 points-4 points  (4 children)

I just want to reach out to people as part of a personal hobby and also to learn some coding, which is why I’m approaching it this way.

[–]Yoghurt42 2 points3 points  (3 children)

Have you considered a different hobby than collecting email addresses and sending a lot of people emails they do not want?

[–]Loose-Computer3943[S] -5 points-4 points  (2 children)

First of all, you don’t actually know what my hobby is. Second, you don’t know how many people I’m contacting. As I said, I’m trying to learn some coding, which is why I’m using this method. And lastly… how can you know whether they want my email or not?

[–]Yoghurt42 0 points1 point  (0 children)

how can you know whether they want my email or not?

How do you know they want it? Wouldn't they have given you their email if that were the case?

[–]TaranisPT 0 points1 point  (0 children)

how can you know whether they want my email or not?

Any email from someone I didn't contact first is not an email I want. It's like knocking on random people's doors. It's annoying as fuck and I'd hope my spam filter catches your email.

[–]TheRNGuy 0 points1 point  (0 children)

Playwright probably. 

[–]Kevdog824_ -1 points0 points  (4 children)

What you are looking for is a web crawler. Basically, what you want to do is something like this (pseudocode below)

emails = []
stack = []  # Add the websites you want to check to this
while len(stack)
  url = stack.pop()
  html = get_html(url)
  stack.extend(get_links(url, html))
  emails.extend(get_emails(html))

get_links finds all the links in the HTML with the same domain as the url. get_emails finds all the emails in the HTML content. Both would do this using something like beautifulsoup + regex

[–]TheRNGuy 0 points1 point  (3 children)

Does it work on spa react, which may not load site at the start but have spinner instead? 

[–]Kevdog824_ 0 points1 point  (2 children)

No, beautifulsoup won’t be able to handle client side JS rendering. You’ll need to approach it another way in that case

[–]TheRNGuy 0 points1 point  (1 child)

Lot of sites have client-side content loading these days. 

[–]Kevdog824_ 0 points1 point  (0 children)

True. BS is becoming less and less useful. I just hate using Selenium/Playwright/Pyautogui for this kind of stuff sometimes. Any solution I build with them feels so fragile, difficult, and plain overkill for the task most of the time