Skip to content

NodesCrew/torrent-knowledge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Torrent Knowledge

Torrent names parser, which you can customize and train.

Main ideas:

  • use file name features enumerations (e.g. known video codecs, release groups, video sources etc) in parser regular expressions (see settings/video_*.json)
  • collect statistics (train mode) and using it for speedup future parsing (
  • use torrent names masks for fast parsing possibility estimation

Command line arguments

usage: main.py [-h] -i INPUT_FILE -o OUTPUT_FILE [-v] [-t] [-l LOG_DIR]

Run torrent-knowledge cli

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        CSV with torrents in format 'info_hash|torrent_title'
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
  -v, --verbose         increase log verbosity via -vvv
  -t, --train-mode      update features frequency from dataset
  -l LOG_DIR, --log-dir LOG_DIR
                        directory for log files (only for -vvv mode)

 

Train

You should train parser for specific features extraction and name detection. Parser will collect statistics for your torrents dataset and save it's into trained config file.

user@mbpro$ wc -l settings/torrents_masks.json
     121 settings/torrents_masks.json

user@mbpro$ wc -l datasets/torrents.csv
 6808916 datasets/torrents.csv

user@mbpro$ ./main.py -vvv -t -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete:  5294016, lines/sec: 11089, tv200: 02881, tv404: 04932, ep200: 38679, ep404: 05564
log - root - DEBUG - Complete in 477.60
user@mbpro$ wc -l /tmp/torrents.json
  199731 /tmp/torrents.json

user@mbpro$ ./main.py -vvv -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete:  5294016, lines/sec: 14514, tv200: 02620, tv404: 03385, ep200: 38142, ep404: 05494
log - root - DEBUG - Complete in 364.95

user@mbpro$ wc -l /tmp/torrents.json
  192036 /tmp/torrents.json

user@mbpro$ wc -l settings/torrents_masks.json
   14946 settings/torrents_masks.json

Untrained mode cons:

  • less accurate results (~4%)

Untrained mode pros:

  • faster batch processing (~30% speedup)

Masks file

If you open trained settings/torrents_masks.json, you can see something like this:

"_a99a99_999a_a999_aaa": {
    "freq": 22408,
    "pattern": "{series_name}{space}s{season_no}e{episode_no}{space}{video_resolution}{space}{video_codec}{space}{release_group}",
    "samples": [
        "aquarius.s02e11.480p.x264-msd",
        "amanchu.s01e03.480p.x264-msd",
        "airmageddon.s01e03.480p.x264-msd"
    ],
    "masks": {
        "aaaaaaaa_a99a99_999a_a999_aaa": 559,
        "aaaaaaa_a99a99_999a_a999_aaa": 373,
        "aaaaaa_a99a99_999a_a999_aaa": 278,
        "aaaaa_a99a99_999a_a999_aaa": 274,
        "aaaaaaaaa_a99a99_999a_a999_aaa": 264,
        "aaaa_a99a99_999a_a999_aaa": 236,
        ...
    }
}
  • key must be unique - no extra requirements
  • samples - this list used for pattern testing before parsing and for documentation purposes
  • freq - this field counts torrents matched by this pattern in train mode
  • masks dict also generated by training. Dict keys are matched torrents names masks, values - matched torrents names count.

How to transform torrent name to mask? It's not too hard

    # See MaskParser::clean_title for implementation details
    ...
    self._trans_table = str.maketrans(
        "abcdefghijklmnopqrstuvwxyz"
        "0123456789" +
        self._ch_punkt +
        self._ch_spaces,

        "aaaaaaaaaaaaaaaaaaaaaaaaaa"
        "9999999999" +
        "p" * len(self._ch_punkt) +
        "_" * len(self._ch_spaces)
    )

    mask = self.clean_title(title).translate(self._trans_table)

About

configurable torrent name parsers prototype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published