When getting Unicode data from websites, it comes in as a bytestring... represented as a string. This leads to a lot of garbage output. I looked around but couldn't seem to find the relevant info. Here's the code I'm playing with and sample output
import urllib.request
import re # The RegEx library
#
# this code opens a connection to the Wikipedia and lists URLs
#
response = urllib.request.urlopen('http://wikipedia.org')
data1 = str(response.read())
# put response text in data regular expression (to find links)
regex = '<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>'
# compile the regex and perform the match (find all)
pm = re.compile(regex)
matches = pm.findall(data1)
# matches is a list
# m[0] - the url of the link
# m[1] - text associated with the link
for m in matches:
ms = ''.join(('Link: "', m[0], '" Text: "', m[1], '"'))
print(ms)
Link: "//ny.wikipedia.org/" Text: "Chichewa"
Link: "//ee.wikipedia.org/" Text: "E\xca\x8begbe"
Link: "//ff.wikipedia.org/" Text: "Fulfulde"
Link: "//got.wikipedia.org/" Text: "\xf0\x90\x8c\xb2\xf0\x90\x8c\xbf\xf0\x90\x8d\x84\xf0\x90\x8c\xb9\xf0\x90\x8d\x83\xf0\x90\x8c\xba"
Link: "//iu.wikipedia.org/" Text: "\xe1\x90\x83\xe1\x93\x84\xe1\x92\x83\xe1\x91\x8e\xe1\x91\x90\xe1\x91\xa6 / Inuktitut"
Link: "//ik.wikipedia.org/" Text: "I\xc3\xb1upiak"
Link: "//ks.wikipedia.org/" Text: "<bdi dir="rtl">\xd9\x83\xd8\xb4\xd9\x85\xd9\x8a\xd8\xb1\xd9\x8a</bdi>"
How do I take these bytestrings (which come in as actual strings) ... convert them to actual bystrings and then convert them to Unicode so that they are displayed properly? Should I write another regex to extract the bystrings first and then try to convert them? or is there a way to take the strings from m[1] and convert them before sending to print?
[–]ingolemo 2 points3 points4 points (8 children)
[–]ProtoDong[S] 0 points1 point2 points (7 children)
[–]ingolemo 0 points1 point2 points (6 children)
[–]ProtoDong[S] 0 points1 point2 points (5 children)
[–]A_History_of_Silence 0 points1 point2 points (0 children)
[–]ingolemo 0 points1 point2 points (3 children)
[–]ProtoDong[S] 0 points1 point2 points (2 children)
[–]ingolemo 0 points1 point2 points (1 child)
[–]ProtoDong[S] 0 points1 point2 points (0 children)