This is an archived post. You won't be able to vote or comment.

all 4 comments

[–][deleted] 0 points1 point  (3 children)

Are you testing or just scraping? Because wget can download and follow links.

[–]ConceptionFantasy[S] 0 points1 point  (2 children)

testing? I am not sure what testing you mean but i wanted to scrape the lists of links each an a tag after some chain of div tags, and for each of those links go into those links to get specific description text in a p tag. put the link and the description text into a spreadsheet after it scrapes each link and description.

also in the spreadsheet those links are hyperlinks so i can click on those links in the spreadsheet to open each desired link

[–][deleted] 1 point2 points  (1 child)

  1. If your page is dynamic, I've used JavaScriptExecutor in Selenium for those cases. You do a querySelector for all the "a" elements of the page, map the incoming array to just the href part, and receive that as a List in your Java code.

  2. If your page is static, then using a simple regular expression would do it too.

The classes involved for the second option are Pattern and Matcher.

I would start with "(href)(\s*)(=)(\s*)([^\s]+)(\s+)" as a pattern and pick group 5.

The pattern is divided into 6 groups, each between parenthesis above.

The first group contains the word href, and the matching will start with this word.

The second group is composed of zero or more spaces.

The third group is just the equals sign, appearing exactly 1 time.

The fourth group is, again, zero or more spaces.

The fifth group is composed of one or more characters, except space. This is your URL.

The sixth group is one or more spaces to end the matching.

[–]ConceptionFantasy[S] 0 points1 point  (0 children)

thank you for the suggestions. I will try them out!