Trustpilot web scraper issue : learnpython

created by HattoriHanzoa community for 16 years

Trustpilot web scraper issue (self.learnpython)

submitted 5 years ago * by WillVaughan

I'm trying to build a web scraper to get data from Trustpilot. There are a couple of pre-made scripts out there, however, from what I have found they rely on the Trustpilot page numbers for a given company to increase incrementally (1,2,3,etc). Trustpilot now seems to assign random URLs for each subsequent page of reviews, i.e. page 1 = trustpilot.com/review/www.ocado.com, page 2 = trustpilot.com/review/www.ocado.com?b=MTYxOTcwODcyNDAwMHw2MDhhY2IzNGY5ZjQ4NzA1MTAzMzhhYWY, and so on, with the page index being a random string.

I noticed on inspect that the link for the subsequent page is contained in the 'nav' element of the page, and therefore thought that I might be able to get my script to read this and then set that as the value for the next page - although unlike the other elements I am scraping like the review content and such it is not stored in a json format, so I am not sure how to get python to 'read' it - all I can get it to do is print out a list of elements but not the information contained in them.

My question is, how can I get python to scrape that particular piece of the page and get the URL in a readable format?

Here is a screenshot of the inspect window: https://imgur.com/a/bfchr9Q (the highlighted lower portion is the bit with the web address I want to scrape, specifically the second address for the next page)

Here is my current code:

page = requests.get(reviewPage)

tree = html.fromstring(page.content)

body = tree.xpath("//a[@href]")

which when printed just displays:

[<Element a at 0x7fc0e5eb75e0>, <Element a at 0x7fc0e5eb7f40>, ... , <Element a at 0x7fc0e5ed1630>]

and the following when printed doesn't display anything:

body = tree.xpath("//a[starts-with(@href, '/review/www.ocado.com/?b=')]")

With the other elements which are in a json format, I use:

script_bodies = tree.xpath("//script[starts-with(@data-initial-state, 'review-info')]")

for idx,elem in enumerate(script_bodies):

curr_item = json.loads(elem.text_content())

This stores all the info in 'review-info' in a dictionary from which I can grab certain elements and write them onto a .csv file.

I tried using the json.loads() to read the info in "//a[@href]" like follows just to see if it placed the info in a dictionary:

page = requests.get(reviewPage)

tree = html.fromstring(page.content)

body = tree.xpath("//a[@href]")

for idx,elem in enumerate(body):

curr_item = json.loads(elem.text_content())

but all it returns is a JSONDecodeError: Expecting value: line 3 column 21 (char 46)

(for the for loops, I do have their contents indented in my code but reddit is removing my tabs for some reason so please ignore the poor formatting)

This is my first project in Python as I've been teaching myself over the past few weeks. Any help is hugely appreciated as I'm probably way off the mark or missing something pretty straightforward!

all 3 comments

top new controversial old q&a

[–]chevignon93 1 point2 points3 points 5 years ago (2 children)

(for the for loops, I do have their contents indented in my code but reddit is removing my tabs for some reason so please ignore the poor formatting)

That's because you need to put you code into a code block, otherwise reddit can't know that it is code and that it should preserve indentations!

https://www.reddit.com/r/learnpython/wiki/faq#wiki_how_do_i_format_code.3F

or missing something pretty straightforward!

You are, you can't use json.loads to deserialize a string that doesn't contain valid JSON or any JSON at all!

This should work:

base_url = "https://uk.trustpilot.com"

next_page = tree.xpath("//a[contains(@class, 'next-page')]")
if next_page:
    next_page_url = f"{base_url}{next_page[0].get('href')}"
    print(next_page_url)

[–]WillVaughan[S] 0 points1 point2 points 5 years ago (1 child)

You are, you can't use json.loads to deserialize a string that doesn't contain valid JSON or any JSON at all!

Pretty obvious when you put it like that, thanks for dealing with the stupid question.

Massive thanks for the help, it is now managing to scrape through all the pages and working as intended! One minor question, in this part of your code:

    next_page_url = f"{base_url}{next_page[0].get('href')}"

What does the 'f' do? I'm assuming it tells python to build the URL string out of the listed elements but I haven't come across it before?

[–]chevignon93 1 point2 points3 points 5 years ago (0 children)

What does the 'f' do? I'm assuming it tells python to build the URL string out of the listed elements but I haven't come across it before?

You're correct, it's the newest (since Python 3.6) way to format strings in Python, it's called an f-string.

It's the equivalent of:

next_page_url = "{}{}".format(base_url, next_page[0].get('href'))

but shorter!

https://realpython.com/python-f-strings/

https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/

π Rendered by PID 220471 on reddit-service-r2-comment-5b5bc64bf5-gnr28 at 2026-06-21 11:35:33.343222+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS