Reading the fileset from a torrent

Asked 2/1, 2009 at 12:42 Answered 9/6, 2024 at 16:37

I want to (quickly) put a program/script together to read the fileset from a .torrent file. I want to then use that set to delete any files from a specific directory that do not belong to the torrent.

Any recommendations on a handy library for reading this index from the .torrent file? Whilst I don't object to it, I don't want to be digging deep into the bittorrent spec and rolling a load of code from scratch for this simple purpose.

I have no preference on language.

Umbel answered 2/1, 2009 at 12:42 Comment(0)

Effbot has your question answered. Here is the complete code to read the list of files from .torrent file (Python 2.4+):

import re

def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield "s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.next, src.next())
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

if __name__ == "__main__":
    data = open("test.torrent", "rb").read()
    torrent = decode(data)
    for file in torrent["info"]["files"]:
        print "%r - %d bytes" % ("/".join(file["path"]), file["length"])

Herren answered 14/2, 2009 at 14:9 Comment(0)

I would use rasterbar's libtorrent which is a small and fast C++ library.
To iterate over the files you could use the torrent_info class (begin_files(), end_files()).

There's also a python interface for libtorrent:

import libtorrent
info = libtorrent.torrent_info('test.torrent')
for f in info.files():
    print "%s - %s" % (f.path, f.size)

Postpositive answered 2/1, 2009 at 13:42 Comment(1)

This is much simpler code and nice output thanks to this library! For anyone looking for a more human readable list or maybe for a cleaner file list for handling in other scripts, try replacing the torrent file name and trailing slash. For example print(f.path.replace("test/", "")) – Najera 26/10, 2020 at 8:32

Effbot has your question answered. Here is the complete code to read the list of files from .torrent file (Python 2.4+):

import re

def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield "s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.next, src.next())
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

if __name__ == "__main__":
    data = open("test.torrent", "rb").read()
    torrent = decode(data)
    for file in torrent["info"]["files"]:
        print "%r - %d bytes" % ("/".join(file["path"]), file["length"])

Herren answered 14/2, 2009 at 14:9 Comment(0)

bencode.py from the original Mainline BitTorrent 5.x client (http://download.bittorrent.com/dl/BitTorrent-5.2.2.tar.gz) would give you pretty much the reference implementation in Python.

It has an import dependency on the BTL package but that's trivially easy to remove. You'd then look at bencode.bdecode(filecontent)['info']['files'].

Dg answered 2/1, 2009 at 13:0 Comment(3)

this only give the ability to bencode and bdecode strings though, right? But no knowledge of where the bencoded fileset strings actually start and end. i.e after the bencoded metadata and before the binary block – Umbel 2/1, 2009 at 13:11

The root and info objects are both dictionaries (mappings). There's no inherent ordering of the file metadata and the binary checksum strings, except that by convention dictionaries are output in key name order. You need not concern yourself with storage order, just suck the whole dictionary in. – Dg 2/1, 2009 at 14:32

libtorrent is written in C with python bindings and at the time of writing this msg libtorrent does not build for Python-3.11 therefore it is not available for py3.11 from PyPI, On the other hand bencode.py is pure python, it installed with pip just fine and did the job. – Acquah 6/8, 2024 at 18:21

Here's the code from Constantine's answer above, slightly modified to handle Unicode characters in torrent filenames and fileset filenames in torrent info:

import re

def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield "s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.next, src.next())
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

n = 0
if __name__ == "__main__":
    data = open("C:\\Torrents\\test.torrent", "rb").read()
    torrent = decode(data)
    for file in torrent["info"]["files"]:
        n = n + 1
        filenamepath = file["path"]     
        print str(n) + " -- " + ', '.join(map(str, filenamepath))
        fname = ', '.join(map(str, filenamepath))

        print fname + " -- " + str(file["length"])

Necessitous answered 4/3, 2017 at 17:37 Comment(0)

Expanding on the ideas above, I did the following:

~> cd ~/bin

~/bin> ls torrent*
torrent-parse.py  torrent-parse.sh

~/bin> cat torrent-parse.py
# torrent-parse.py
import sys
import libtorrent

# get the input torrent file
if (len(sys.argv) > 1):
    torrent = sys.argv[1]
else:
    print "Missing param: torrent filename"
    sys.exit()
# get names of files in the torrent file
info = libtorrent.torrent_info(torrent);
for f in info.files():
    print "%s - %s" % (f.path, f.size)

~/bin> cat torrent-parse.sh
#!/bin/bash
if [ $# -lt 1 ]; then
  echo "Missing param: torrent filename"
  exit 0
fi

python torrent-parse.py "$*"

You'll want to set permissions appropriately to make the shell script executable:

~/bin> chmod a+x torrent-parse.sh

Hope this helps someone :)

Rectory answered 2/6, 2012 at 22:14 Comment(0)

Here is the code from Alix above, modified to run under python 3.x

import re

def tokenize(text, match=re.compile(b"([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield "s"
            yield text[i:i+int(s)].decode("utf-8", "ignore")
            i = i + int(s)
        else:
            yield s.decode("utf-8", "ignore")

def decode_item(torrent, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next(torrent))
        if next(torrent) != "e":
            raise ValueError
    elif token == "s":
        data = next(torrent)
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next(torrent)
        while tok != "e":
            data.append(decode_item(torrent, tok))
            tok = next(torrent)
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        token=next(src)
        data = decode_item(src, token)
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

n = 0
if __name__ == "__main__":
    data = open("test.torrent", "rb").read()
    torrent = decode(data)
    for file in torrent["info"]["files"]:
        n = n + 1
        filenamepath = file["path"]
        print(str(n) + " -- " + ', '.join(map(str, filenamepath)))
        fname = ', '.join(map(str, filenamepath))

        print(fname + " -- " + str(file["length"]))

Theocracy answered 9/6, 2024 at 16:37 Comment(0)

Recommended topics

Hot tags