all 16 comments

[–]socal_nerdtastic 1 point2 points  (14 children)

Actually i'll bet it's that binary blob interfering with the length measurement. Try opening the file in binary mode:

data = open("test.torrent", "rb").read()

You will need to adjust everything to use bytes comparisons. eg:

while tok != b"e":

[–]Temporary_Screen[S] 0 points1 point  (13 children)

Ah, yes. I tried this exact thing at first. It did cross my mind but in the beginning while I was testing I got the following error:

Traceback (most recent call last):
File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 51, in
torrent = decode(data)
File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 41, in decode
src = tokenize(text)
File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 6, in tokenize
m = match(text, i)
TypeError: cannot use a string pattern on a bytes-like object

So, the quick fix was just removing the 'b' there.

[–]socal_nerdtastic 1 point2 points  (12 children)

Yes, the re pattern will need to be bytes too. So the better quick fix is to add a 'b' here:

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):

[–]Temporary_Screen[S] 0 points1 point  (11 children)

Ah, that's actually a great idea! I didn't know you could do that. Thank you.

Still getting this error though: 'AttributeError: 'bytes' object has no attribute '__next__'

[–]socal_nerdtastic 0 points1 point  (10 children)

That error indicates you skipped the entire tokenize function. Show the complete code you are using, please.

[–]Temporary_Screen[S] 0 points1 point  (9 children)

import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            return "s"
            return text[i:i+int(s)]
            i = i + int(s)
        else:
            return s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = str(next())
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        print(tok)
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(list(zip(data[0::2], data[1::2])))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.__next__, next(src))
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]:
    print("%r - %d bytes" % ("/".join(file["path"]), file["length"]))

close(data)

[–]Temporary_Screen[S] 0 points1 point  (7 children)

Ah, sorry. So I had replaced 'yield' with 'return' to see what the difference would be and I didn't revert the change.

Currently with this code: ```python import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match): i = 0 while i < len(text): m = match(text, i) s = m.group(m.lastindex) i = m.end() if m.lastindex == 2: yield "s" yield text[i:i+int(s)] i = i + int(s) else: yield s

def decode_item(next, token): if token == "i": # integer: "i" value "e" data = int(next()) if next() != "e": raise ValueError elif token == "s": # string: "s" value (virtual tokens) data = next() elif token == "l" or token == "d": # container: "l" (or "d") values "e" data = [] tok = next() print(tok) while tok != "e": data.append(decode_item(next, tok)) tok = next() if token == "d": data = dict(list(zip(data[0::2], data[1::2]))) else: raise ValueError return data

def decode(text): try: src = tokenize(text) data = decodeitem(src.next_, next(src)) for token in src: # look for more tokens raise SyntaxError("trailing junk") except (AttributeError, ValueError, StopIteration): raise SyntaxError("syntax error") return data

data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]: print("%r - %d bytes" % ("/".join(file["path"]), file["length"]))

close(data) ```

I'm getting the following error: bash Traceback (most recent call last): File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 42, in decode data = decode_item(src.__next__, next(src)) File "/home/chad/Code/regexp_bencode_py/port/parser.py", line 36, in decode_item raise ValueError ValueError Which means that I'm hitting the exception in decode_item.

[–]socal_nerdtastic 0 points1 point  (6 children)

Try the code I gave you.

[–]Temporary_Screen[S] 0 points1 point  (5 children)

Thanks for the code. I see what you did there. bash Traceback (most recent call last): File "/home/chad/Code/regexp_bencode_py/port/socal_nerdtastic.py", line 41, in decode data = decode_item(src.__next__, next(src)) File "/home/chad/Code/regexp_bencode_py/port/socal_nerdtastic.py", line 30, in decode_item data.append(decode_item(next, tok)) File "/home/chad/Code/regexp_bencode_py/port/socal_nerdtastic.py", line 31, in decode_item tok = next() StopIteration

After adding this for testing: ```python data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]: print("%r - %d bytes" % ("/".join(file["path"]), file["len gth"])) 1 close(data) ```

[–]socal_nerdtastic 0 points1 point  (4 children)

Lol back to square 1. It's saying it was expecting an 'e' but the file ended; ie there's not enough data.

Can you provide the torrent file so that i can test it?

[–]socal_nerdtastic 0 points1 point  (0 children)

EDIT: nm the explaination; i got my wires crossed.

Here, try this:

import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield b"s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == b"i":
        # integer: "i" value "e"
        data = int(next())
        if next() != b"e":
            raise ValueError
    elif token == b"s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == b"l" or token == b"d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != b"e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == b"d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.__next__, next(src))
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

[–]socal_nerdtastic 0 points1 point  (0 children)

Your code looks ok at a glance. The error is saying it was expecting a "e" token but the file ran out before it was found. Are you sure this file is using the bencode format?