Torrent Knowledge

Torrent names parser, which you can customize and train.

Main ideas:

use file name features enumerations (e.g. known video codecs, release groups, video sources etc) in parser regular expressions (see settings/video_*.json)
collect statistics (train mode) and using it for speedup future parsing (
use torrent names masks for fast parsing possibility estimation

Command line arguments

usage: main.py [-h] -i INPUT_FILE -o OUTPUT_FILE [-v] [-t] [-l LOG_DIR]

Run torrent-knowledge cli

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        CSV with torrents in format 'info_hash|torrent_title'
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
  -v, --verbose         increase log verbosity via -vvv
  -t, --train-mode      update features frequency from dataset
  -l LOG_DIR, --log-dir LOG_DIR
                        directory for log files (only for -vvv mode)

Train

You should train parser for specific features extraction and name detection. Parser will collect statistics for your torrents dataset and save it's into trained config file.

user@mbpro$ wc -l settings/torrents_masks.json
     121 settings/torrents_masks.json

user@mbpro$ wc -l datasets/torrents.csv
 6808916 datasets/torrents.csv

user@mbpro$ ./main.py -vvv -t -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete:  5294016, lines/sec: 11089, tv200: 02881, tv404: 04932, ep200: 38679, ep404: 05564
log - root - DEBUG - Complete in 477.60
user@mbpro$ wc -l /tmp/torrents.json
  199731 /tmp/torrents.json

user@mbpro$ ./main.py -vvv -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete:  5294016, lines/sec: 14514, tv200: 02620, tv404: 03385, ep200: 38142, ep404: 05494
log - root - DEBUG - Complete in 364.95

user@mbpro$ wc -l /tmp/torrents.json
  192036 /tmp/torrents.json

user@mbpro$ wc -l settings/torrents_masks.json
   14946 settings/torrents_masks.json

Untrained mode cons:

less accurate results (~4%)

Untrained mode pros:

faster batch processing (~30% speedup)

Masks file

If you open trained settings/torrents_masks.json, you can see something like this:

"_a99a99_999a_a999_aaa": {
    "freq": 22408,
    "pattern": "{series_name}{space}s{season_no}e{episode_no}{space}{video_resolution}{space}{video_codec}{space}{release_group}",
    "samples": [
        "aquarius.s02e11.480p.x264-msd",
        "amanchu.s01e03.480p.x264-msd",
        "airmageddon.s01e03.480p.x264-msd"
    ],
    "masks": {
        "aaaaaaaa_a99a99_999a_a999_aaa": 559,
        "aaaaaaa_a99a99_999a_a999_aaa": 373,
        "aaaaaa_a99a99_999a_a999_aaa": 278,
        "aaaaa_a99a99_999a_a999_aaa": 274,
        "aaaaaaaaa_a99a99_999a_a999_aaa": 264,
        "aaaa_a99a99_999a_a999_aaa": 236,
        ...
    }
}

key must be unique - no extra requirements
samples - this list used for pattern testing before parsing and for documentation purposes
freq - this field counts torrents matched by this pattern in train mode
masks dict also generated by training. Dict keys are matched torrents names masks, values - matched torrents names count.

How to transform torrent name to mask? It's not too hard

    # See MaskParser::clean_title for implementation details
    ...
    self._trans_table = str.maketrans(
        "abcdefghijklmnopqrstuvwxyz"
        "0123456789" +
        self._ch_punkt +
        self._ch_spaces,

        "aaaaaaaaaaaaaaaaaaaaaaaaaa"
        "9999999999" +
        "p" * len(self._ch_punkt) +
        "_" * len(self._ch_spaces)
    )

    mask = self.clean_title(title).translate(self._trans_table)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
datasets		datasets
lib		lib
settings		settings
test		test
README.md		README.md
main.py		main.py
update_datasets.sh		update_datasets.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Torrent Knowledge

Command line arguments

Train

Masks file

About

Uh oh!

Releases

Packages

Languages

NodesCrew/torrent-knowledge

Folders and files

Latest commit

History

Repository files navigation

Torrent Knowledge

Command line arguments

Train

Masks file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages