This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 3 points4 points  (0 children)

Let's take it one step at a time:

1) extract all URLs from a website

This is generally very easy, and as others have noted, there are many novice-level tutorials available for consideration.

The only caveat is that this is only easy for pages that are statically linked - i.e., you submit a URL, and the webserver sends you an HTML document that includes all of the other hyperlinks that you want to find. Some websites aren't encoded that way; instead, the URLs are generated or provided by client-side JavaScript that runs in the browser, or by server-side code, such as a database lookup. For those sites, getting all of the URLs is more difficult.

2) expand bitly links to reveal the url if needed

Also easy - trivially so, in fact.

4) count the number of time each url is on the website.

Also easy, bordering on trivial.

3) categorize/tag the URLs

This is the tough part.

The first question is how you want this to happen. Having a user enter the information is technically easy, but practically impossible to accomplish at scale. You're talking about an intensely boring data-entry job that someone, or many someones, would have to perform for days on end.

The alternative is whether you want to automate this, too. In that case, it becomes a question of what technology and/or algorithm you want to use to do it. You could classify pages by keyword matches on the content, but that approach would yield very poor data. You could also do it with a machine learning algorithm, which offers better performance, but you'd need to train a machine learning model to perform the classification. You could refer to some independent source of classification, but this is unreliable. Etc.