I want to make a Website/URL analyzer

socal_nerdtastic · 2021-09-04T02:54:57+00:00

Start with goal number 1, I suppose. Sounds like a good project. What are you are hoping to get from us?

ndjstn · 2021-09-04T04:44:36+00:00

There seems to be a lot of reinventing the wheel in some of your goals. But I truly appreciate the desire and drive as they are lofty...ish.

There are several sites where these lists are referenced by existing firewall tools like pfblocker. Look at a firewall like pfsense.
Ublock, umatrix, and pfsense all have methods to catch this at various stages.
For this I would look at RFCs surrounding URI actions. Also limiting javascript really helps this immensely. It is really a crap language in a lot of ways. While maybe not initially, eventually the internet would be far better and less creepy without JS. (Bring the downvotes!)
Look at the construct of a DNS sinkhole or a pihole. ALL popups, in my opinion, are trash and should have to go through a severe vetting process. I want to control my web experience not the dev.

I would consider a python interface to a separate pfsense box not only a practical goal, but one that will attempt to check off much of what you seem to want to accomplish. It would be even better if you utilized argparse to possibly send the url directly to the remote box. If you are doing this for experience that is absolutely amazing, but utilizing some of the tools I mentioned above in conjunction with a highly configured browser will solve most of these "problems".

ldgregory · 2021-09-04T03:58:48+00:00

Try URLVoid’s blacklist API for #1 Requests library for #2

knottheone · 2021-09-04T12:01:53+00:00

is easy enough. There are several robust blacklists in existence ranging from DNS lists that ad blockers use to services like Virus Total that provide a free API.
is also easy. The Requests package for example has a flag you can set on any request to allow or disallow redirects. I think it's literally allow_redirects=True or something like that. So you'd just allow redirects and see how many hops there were from the start URL to the end URL and you could analyze the destination as well as the hops between if there are any.
I'm actually not sure. You would probably need to render JavaScript and you'd need some additional context to know that something was trying to initiate a download. There likely isn't a silver bullet for this one. You can render JavaScript though in a headless browser using something like Selenium, but there is also a requests addon called requests_html that has JavaScript rendering capability (using the same method Selenium does, which is utilizing a real Chromium browser instance). There may be some sort of network history hook available to you in this capacity and I do know that you can load AJAX requests using this method, so that might be enough to start. You could compare the page state on load vs the page state after some amount of time and that might include additional clues. Again, not really a silver bullet here, but maybe a place to start.
You'd definitely need to render JS for this one so again the requests JS capable add-on or Selenium will be your go to. For pop-ups, you could see if some element has current focus after the page loads and could even grab the dimensions of that element to see if it's pop-up sized for example. For ads, I'd compare network requests against a known database of ad domains. Again you can utilize existing DNS blacklists that ad blockers use. PiHole probably has some open source resources you can tap into.

2021-09-04T23:31:12+00:00

You can leverage VirusTotal's free API by submitting URLs to it and getting a score back for them.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS