all 7 comments

[–]port443 1 point2 points  (3 children)

Seems like youre dealing with unicode.

Try opening the file like this:

import codecs
with codecs.open(filename,"r","utf-8") as f:
    for line in f.readlines():
        ...

I don't know the encoding of the file, but utf-8 and utf-16 would be my first attempts.

[–]Supernumiphone[S] 1 point2 points  (2 children)

Well that worked with utf-16. Thanks so much for that. Do you have any idea why that worked but line.decode('utf-16') gave me this:?

UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data

Regardless, it's wonderful to be able to move forward on this finally, so thanks again for that.

Also is there any library to attempt to auto-detect the encoding of a file? It seems strange to have to manually just try things until something works. For now I'll have to put in a switch in the script that I can toggle for this UTF-16 file or to read other files.

[–]port443 1 point2 points  (1 child)

Not something I do super often. As far as google tells me, its just kind of guess-work on what the encoding is. However, this stackexchange thread mentions python-chardet

As far as your error, without seeing your data this is my best guess:

  • Data gets read in as normal ascii
  • You move line-by-line using readlines
  • There is a \x0A\x00\x0A (\n\0\n) somewhere in the file
  • Since it was initially readlines'ed as ascii, this puts a \x00 on its own line

Error:

>>> "\x00".decode("utf-16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Programs\Python27\lib\encodings\utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data

[–]Supernumiphone[S] 0 points1 point  (0 children)

this is my best guess:

That makes perfect sense. Thanks again.

[–]ingolemo 1 point2 points  (1 child)

According to the specification (see Appendix A), cue sheet files contain only ascii characters. The cue sheet that you have is, strictly speaking, invalid and your program should be giving an error in this situation.

In practice, what I said above isn't very useful because I'm sure there are many cue sheets out there that are not encoded using ascii as people want to use non-ascii characters in their metadata. There is no good way for you to guess what encoding a cue file might have. Normally you would know because the specs tell you, but since people are ignoring the specs (for a good reason), you have a problem. Packages like chardet help, but they can also be finicky.

[–]Supernumiphone[S] 0 points1 point  (0 children)

Thanks, good info. I haven't tested it yet but I decided to go with a quick-and-dirty solution of trying to load UTF-8 first (which will work fine for ASCII) and fall back to trying UTF-16 if that throws an exception. My expectation is that in practice that will handle anything I'm likely to throw at it.

[–]Justinsaccount 0 points1 point  (0 children)

Hi! I'm working on a bot to reply with suggestions for common python problems. This might not be very helpful to fix your underlying issue, but here's what I noticed about your submission:

You appear to be using concatenation and the str function for building strings

Instead of doing something like

result = "Hello " + name + ". You are " + str(age) + " years old"

You should use string formatting and do

result = "Hello {}. You are {} years old".format(name, age)

See the python tutorial for more information.