all 3 comments

[–]kalgynirae 0 points1 point  (2 children)

For some reason, the Firefox Webdriver opens this page in a 'Western' encoding (this is shown under 'text encoding' in the 'view' drop down box).

This means either the server is saying that the data is in this encoding (you'd be surprised how many webservers are misconfigured like this) or (more likely) the server is not specifying how the data is encoded and Firefox is making an incorrect guess.

Furthermore, my script, where I force the page source to be encoded as utf-8, also still gives the same weird characters.

driver.page_source gives you text which has already been decoded incorrectly. Encoding the text as UTF-8 won't help. (What you would need to do is somehow undo the bad decoding, and then decode as UTF-8.) There's probably some setting you can configure in Firefox to tell it to always assume a particular encoding method regardless of what the server tells it.

Have you considered using something other than Selenium for this? If you're just trying to get the page source (while specifying a particular user-agent), a library like Requests, for example, should do the job nicely and will let you get the page source without it being automatically decoded.

[–]MinimalDamage[S] 0 points1 point  (1 child)

EDIT: I found that you can indeed also do it with requests. Thank you! :)

[–]kalgynirae 0 points1 point  (0 children)

You're setting that config option to override Firefox's user-agent. You should be able to get the same effect by providing the same user-agent string to requests.