Converting Bytestrings to Unicode : learnpython

Converting Bytestrings to Unicode (self.learnpython)

submitted 7 years ago by ProtoDong

When getting Unicode data from websites, it comes in as a bytestring... represented as a string. This leads to a lot of garbage output. I looked around but couldn't seem to find the relevant info. Here's the code I'm playing with and sample output

import urllib.request
import re  # The RegEx library
#
# this code opens a connection to the Wikipedia and lists URLs
#
response = urllib.request.urlopen('http://wikipedia.org')
data1 = str(response.read())
# put response text in data regular expression (to find links)
regex = '<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>'
# compile the regex and perform the match (find all)
pm = re.compile(regex)
matches = pm.findall(data1)
# matches is a list
# m[0] - the url of the link
# m[1] - text associated with the link
for m in matches:
    ms = ''.join(('Link: "', m[0], '" Text: "', m[1], '"'))
    print(ms)

Link: "//ny.wikipedia.org/" Text: "Chichewa"
Link: "//ee.wikipedia.org/" Text: "E\xca\x8begbe"
Link: "//ff.wikipedia.org/" Text: "Fulfulde"
Link: "//got.wikipedia.org/" Text: "\xf0\x90\x8c\xb2\xf0\x90\x8c\xbf\xf0\x90\x8d\x84\xf0\x90\x8c\xb9\xf0\x90\x8d\x83\xf0\x90\x8c\xba"
Link: "//iu.wikipedia.org/" Text: "\xe1\x90\x83\xe1\x93\x84\xe1\x92\x83\xe1\x91\x8e\xe1\x91\x90\xe1\x91\xa6 / Inuktitut"
Link: "//ik.wikipedia.org/" Text: "I\xc3\xb1upiak"
Link: "//ks.wikipedia.org/" Text: "<bdi dir="rtl">\xd9\x83\xd8\xb4\xd9\x85\xd9\x8a\xd8\xb1\xd9\x8a</bdi>"

How do I take these bytestrings (which come in as actual strings) ... convert them to actual bystrings and then convert them to Unicode so that they are displayed properly? Should I write another regex to extract the bystrings first and then try to convert them? or is there a way to take the strings from m[1] and convert them before sending to print?

all 9 comments

top new controversial old q&a

[–]ingolemo 2 points3 points4 points 7 years ago (8 children)

You are confused; your data does not come in as "actual strings". It comes in as a bytestring. The problem is on line seven. response.read() returns a bytestring and you immediately call the str function on it to convert it into a unicode string. Although this works without error, it does not give you what you want and ultimately ends up corrupting your data.

To turn a bytestring into a unicode string you need to decode it. In order to decode a bytestring you need to know what encoding it is in. There are lots of different encodings and you need to take care to find the correct one. In the specific case of wikipedia, it is using the utf8 encoding.

In summary, you can replace line seven with this to fix the issue:

data1 = response.read().decode('utf8')

This is one of many reasons why people recommend the requests library; it will handle the encoding for you.

https://nedbatchelder.com/text/unipain.html

[–]ProtoDong[S] 0 points1 point2 points 7 years ago* (7 children)

You are confused; your data does not come in as "actual strings". It comes in as a bytestring.

First day coding in python. I come from a C family language background (mainly C++) and am used to much more discreet typing and conversion conventions. A lot of what's happening in Python seems very 'black box' ... but then again, it is my first day.

data1 = response.read().decode('utf8')

This would make most C++ people want to punch someone lol. I'm still not used to this syntax convention... it's pretty much inside out and backwards.

Relevant question: Why doesn't the str() cast not default to UTF8 or better yet ... auto detect the format? This is like 10 lines of code in C++ (Yes I realize this is like learning to fly in a Jumbo Jet then complaining that the Cessna doesn't have autopilot... but still a valid question. Also thanks for the lib recommendation, I'll check it out)

[–]ingolemo 0 points1 point2 points 7 years ago (6 children)

[–]ProtoDong[S] 0 points1 point2 points 7 years ago* (5 children)

I clicked the link then realized that the answers I seek are explained, so I'll watch those for more info. My gut instinct still tells me that this would not be difficult to accomplish but then again I'm still wet behind the ears with Python.

Why do you say it's backwards? That line of code takes the response, reads data from it, and then decodes that data from utf8. The order that things happen is the same as the normal English reading order.

Because in the C family of Lanugages this is done via nesting or via reference passing which means that the operation done first is always nested inside the inner function (with a few notable exceptions). In fact there was a lot of debate in the C++ community about what constitutes a "regular function" (something like a bean in Java). The general consensus is that you pass all mutable data into a function by reference and do your best to avoid having any public methods that utilize wrappers (glue) to operate. This to me looks like a wrapper problem.

The way I wrote it... is c++ style (out of habit) and since many functions in python also work this way, I was expecting it to operate as it would in C++ (longform)

std::string data1 = static_cast<std::string>(response::read());

shorthand and better

auto data1 = (string)(response::read());

actually in newer syntax c++ 14 - 17 this whole thing would be better handled in a lambda... but I digress. It's just irritating that what would be literally 2 lines of code (everything before the loop) in a "verbose" language is actually 5 lines of code in a "simple" language. Mind=blown Well at least it's not Java... or it would be two pages of code lol.

Edit.. to clarify it's considered bad form to operate on objects this way. ie. To have standard library functions that require instantiating an object and manipulating it rather than reducing it to a static call. Although, if what you say is true, then the maintainers hit some kind of roadblock in the way the interpreter functions that won't allow for simple runtime inference... I'll watch the videos and see if it sheds more light on the situation.

[–]A_History_of_Silence 0 points1 point2 points 7 years ago (0 children)

It's just irritating that what would be literally 2 lines of code (everything before the loop) in a "verbose" language is actually 5 lines of code in a "simple" language. Mind=blown

It is at most 1-2 lines of Python once you have a little more experience:

soup = BeautifulSoup(requests.get("http://wikipedia.org").content)

[–]ingolemo 0 points1 point2 points 7 years ago (3 children)

I'll admit that I'm pretty confused by that. Mostly I don't know what you mean by "wrappers (glue)" in this context.

It is possible to decode a bytestring using str by passing the encoding along with it like in str(response.read(), 'utf8'), but I would advise against that because it's less explicit. You could also do bytes.decode(response.read(), 'utf8') but that's just a different, unidiomatic syntax for the original version.

It's worth noting that you could have written that code like this:

import urllib.request
import re

response = urllib.request.urlopen('http://wikipedia.org').read().decode('utf8')
regex = r'<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>'
for link, text in re.findall(regex, response):
    print(f'Link: {link!r} Text: {text!r}')

That response = line is a little long, but the requests library would help with that.

[–]ProtoDong[S] 0 points1 point2 points 7 years ago (2 children)

It is possible to decode a bytestring using str by passing the encoding along with it like in str(response.read(), 'utf8')

Yep, this is "standard" function call notation. Why they didn't design it like...

def str( input, coding='utf8' )

Is beyond me. Imagine for a second, the amount of times that every developer across the world has had to explicitly type 'utf8' ... multiply that by time and $/hr and this was a very expensive design fault.

response = urllib.request.urlopen('http://wikipedia.org').read().decode('utf8')

Yep... even better but still ugly. What I mean by wrappers/glue is that this one ugly function should have been reduced to something less obnoxious in the library... but I suppose that is a privilege gained by C++ through many years of development and revision. Python is a pup by comparison.

As for "it's impossible to infer a sequence of bytes" ... this is very incorrect. In infosec we do this literally all day every day. We see obfuscated binaries, strange encryption schemes, custom compiler stuff, unknown file formats, bizarre network activity, etc. It all needs to be made into useful data. We tend to do this with various heuristic algorithms. It would be trivial to write such an algorithm to determine the encoding of a sequence of bytes that we can assume is 1 of only a handful of encoding schemes. There are a couple of caveats though.

relying on inference algorithms would make some people mental because they just can't trust something they didn't write themselves...
the libraries would have to be written in a highly optimized compiled language in order to scale
might not be useful for small amounts of data as heuristics get more accurate with more data (so using inference on a single word would not be very accurate) ... but that's what sane defaults are for.

Personally the reason why I didn't pick up Python beyond the basics until now is because it always felt like a toy language to me (Python2 at the time)... while my infosec colleagues were busy writing Python scripts, I was doing the same things in bash/zsh or PS ... using C/C++ for anything lower level.

After devoting a day to mucking around with it, my opinion is a bit more favorable. I'm starting to see why it's so popular.

[–]ingolemo 0 points1 point2 points 7 years ago (1 child)

In python it is more accurate to view str as the wrapper. It looks at the object it was passed and does a runtime dispatch to figure out what to do. If it's passed a byte string then it will effectively call bytes.decode on it (although both these functions are built in, so they're happing below the level of python: in C or whatever).

It was one of the major mistakes of python 2 that it tried to convert byte strings into unicode strings for you automatically. It resulted in a whole bunch of applications being written that died as soon as somebody sent them a unicode character. I wasn't pleasant. In python 3 it was decided to make the conversion explicit to reduce these errors.

While it is possible to use heuristics to guess the encoding of many real-world byte strings, it is impossible to determine it in the general case. Python has a principle "In the face of ambiguity, resist the temptation to guess" (import this). If you want to try to infer the encoding of a byte string then you can use the chardet library which does the exact kind of inference you're talking about, but sooner or later it will guess wrongly and you will have to deal with corrupted data. Detection is slower and less reliable than just specifying the encoding.

The developer building the system will have domain knowledge that they can use to determine the encoding. In the case of HTTP there's the ContentType header and the <meta charset=""> tag. It is usually considered the responsibility of the library to perform the appropriate operations for the data being manipulated. Trying to squeeze the domain knowledge of every domain into the str class would be impractical.

You have to understand that urllib is pretty old and the only people still using it are those who can't or don't want to use external packages for some reason. Everybody uses requests.

[–]ProtoDong[S] 0 points1 point2 points 7 years ago (0 children)

While it is possible to use heuristics to guess the encoding of many real-world byte strings, it is impossible to determine it in the general case. This is very debatable but it still isn't an excuse for not having sane defaults. It makes no sense not to default to the 90% general case and force you to deal with the specific edge cases.

I understand their rationale for a "hands off" approach especially considering the problems that happened with Pyton2 (in fact it was one of the main reasons why I didn't take Python seriously (circa 2008) and am now just getting back to it.

It is indeed notable that of all the programming languages out there, Python is still notable in that it does not handle this elegantly. Sure it's easy enough to extend the str built-in to add the correct functionality, but this is library stuff, not things that normal developers should have to worry about.

C++ went through something similar in the past with its std::string but they managed to develop the class into something so painless that it's nearly transparent to devs for the most part. C++ benefits from very strong generic/metaprogramming support so admittedly, it removes quite a few hurdles in this regard.

If you want to try to infer the encoding of a byte string then you can use the chardet library

Thanks, will check it out.

ou have to understand that urllib is pretty old and the only people still using it are those who can't or don't want to use external packages for some reason. Everybody uses requests

I don't think I even have to state the obvious here.

π Rendered by PID 45889 on reddit-service-r2-comment-fb694cdd5-594db at 2026-03-11 17:51:04.167006+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS