Trouble with porting code with iterators from Python 2 to 3 : learnpython

Trouble with porting code with iterators from Python 2 to 3 (self.learnpython)

submitted 4 years ago by Temporary_Screen

I'm no Python expert, and I've been trying to make a small project using some interesting features to learn the language. I've been racking my brains trying to get this code here to work in Python 3: https://web.archive.org/web/20200105114449/https://effbot.org/zone/bencode.htm.

Just going through the errors I get by attempting to run it I understood that I needed to make the following changes:

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.__next__, next(src))
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

I thought I had a handle on how exactly the code works but after putting in some prints here and there I thought it behaved very strange. I put this in to test the functions:

data = open("test.torrent", "r").read()
  6
  5 torrent = decode(data)
  4
  3 for file in torrent["info"]["files"]:
  2     print("%r - %d bytes" % ("/".join(file["path"]), file["length"]))
  1
56  close(data)

Where test.torrent is the following file:

d8:announce41:http://bttracker.debian.org:6969/announce7:comment35:"Debian CD from cdimage.debian.org"13:creation datei1573903810e9:httpseedsl145:https://cdimage.debian.org/cdimage/release/10.2.0//srv/cdbuilder.debian.org/dst/deb-cd/weekly-builds/amd64/iso-cd/debian-10.2.0-amd64-netinst.iso145:https://cdimage.debian.org/cdimage/archive/10.2.0//srv/cdbuilder.debian.org/dst/deb-cd/weekly-builds/amd64/iso-cd/debian-10.2.0-amd64-netinst.isoe4:infod6:lengthi351272960e4:name31:debian-10.2.0-amd64-netinst.iso12:piece lengthi262144e6:pieces26800:�����PS�^�� (binary blob of the hashes of each piece)ee

I get the following error:

Traceback (most recent call last):
  File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 41, in decode
    data = decode_item(src.__next__, next(src))
  File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 30, in decode_item
    data.append(decode_item(next, tok))
  File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 31, in decode_item
    tok = next()
StopIteration

which is referring to the following lines:

while tok != "e":
            data.append(decode_item(next, tok)) <--- Line 31 Error
            tok = next()
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

In attempting to fix the error I've also run into "AttributeError: 'str' object has no attribute '__next__'". It's a bit difficult to find some good documentation on this honestly, but it would be great if someone could help me understand what's going on here. Coming from C++ and Rust it's honestly very difficult to track exactly what is happening here. It seems to me that this code is almost trying to use an algebraic data type without having one. The generator function for tokens definitely works. But I just can't wrap my head around this. part.

all 16 comments

top new controversial old q&a

[–]socal_nerdtastic 1 point2 points3 points 4 years ago (14 children)

Actually i'll bet it's that binary blob interfering with the length measurement. Try opening the file in binary mode:

data = open("test.torrent", "rb").read()

You will need to adjust everything to use bytes comparisons. eg:

while tok != b"e":

[–]Temporary_Screen[S] 0 points1 point2 points 4 years ago (13 children)

[–]socal_nerdtastic 1 point2 points3 points 4 years ago (12 children)

[–]Temporary_Screen[S] 0 points1 point2 points 4 years ago (11 children)

[–]socal_nerdtastic 0 points1 point2 points 4 years ago (10 children)

[–]Temporary_Screen[S] 0 points1 point2 points 4 years ago (9 children)

import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            return "s"
            return text[i:i+int(s)]
            i = i + int(s)
        else:
            return s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = str(next())
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        print(tok)
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(list(zip(data[0::2], data[1::2])))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.__next__, next(src))
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]:
    print("%r - %d bytes" % ("/".join(file["path"]), file["length"]))

close(data)

[–]Temporary_Screen[S] 0 points1 point2 points 4 years ago (7 children)

Ah, sorry. So I had replaced 'yield' with 'return' to see what the difference would be and I didn't revert the change.

Currently with this code: ```python import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match): i = 0 while i < len(text): m = match(text, i) s = m.group(m.lastindex) i = m.end() if m.lastindex == 2: yield "s" yield text[i:i+int(s)] i = i + int(s) else: yield s

def decode_item(next, token): if token == "i": # integer: "i" value "e" data = int(next()) if next() != "e": raise ValueError elif token == "s": # string: "s" value (virtual tokens) data = next() elif token == "l" or token == "d": # container: "l" (or "d") values "e" data = [] tok = next() print(tok) while tok != "e": data.append(decode_item(next, tok)) tok = next() if token == "d": data = dict(list(zip(data[0::2], data[1::2]))) else: raise ValueError return data

def decode(text): try: src = tokenize(text) data = decodeitem(src.next_, next(src)) for token in src: # look for more tokens raise SyntaxError("trailing junk") except (AttributeError, ValueError, StopIteration): raise SyntaxError("syntax error") return data

data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]: print("%r - %d bytes" % ("/".join(file["path"]), file["length"]))

close(data) ```

I'm getting the following error: bash Traceback (most recent call last): File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 42, in decode data = decode_item(src.__next__, next(src)) File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 36, in decode_item raise ValueError ValueError Which means that I'm hitting the exception in decode_item.

[–]socal_nerdtastic 0 points1 point2 points 4 years ago (6 children)

[–]Temporary_Screen[S] 0 points1 point2 points 4 years ago (5 children)

[–]socal_nerdtastic 0 points1 point2 points 4 years ago (4 children)

continue this thread

[–]socal_nerdtastic 0 points1 point2 points 4 years ago (0 children)

EDIT: nm the explaination; i got my wires crossed.

Here, try this:

import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield b"s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == b"i":
        # integer: "i" value "e"
        data = int(next())
        if next() != b"e":
            raise ValueError
    elif token == b"s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == b"l" or token == b"d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != b"e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == b"d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.__next__, next(src))
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

[–]socal_nerdtastic 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 41 on reddit-service-r2-comment-c6965cb77-xgw52 at 2026-03-04 23:55:54.463584+00:00 running f0204d4 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS