Torrent names parser, which you can customize and train.
Main ideas:
- use file name features enumerations (e.g. known video codecs, release groups, video sources etc) in parser regular expressions (see settings/video_*.json)
- collect statistics (train mode) and using it for speedup future parsing (
- use torrent names masks for fast parsing possibility estimation
usage: main.py [-h] -i INPUT_FILE -o OUTPUT_FILE [-v] [-t] [-l LOG_DIR]
Run torrent-knowledge cli
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
CSV with torrents in format 'info_hash|torrent_title'
-o OUTPUT_FILE, --output_file OUTPUT_FILE
-v, --verbose increase log verbosity via -vvv
-t, --train-mode update features frequency from dataset
-l LOG_DIR, --log-dir LOG_DIR
directory for log files (only for -vvv mode)
You should train parser for specific features extraction and name detection. Parser will collect statistics for your torrents dataset and save it's into trained config file.
user@mbpro$ wc -l settings/torrents_masks.json
121 settings/torrents_masks.json
user@mbpro$ wc -l datasets/torrents.csv
6808916 datasets/torrents.csv
user@mbpro$ ./main.py -vvv -t -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete: 5294016, lines/sec: 11089, tv200: 02881, tv404: 04932, ep200: 38679, ep404: 05564
log - root - DEBUG - Complete in 477.60
user@mbpro$ wc -l /tmp/torrents.json
199731 /tmp/torrents.json
user@mbpro$ ./main.py -vvv -i datasets/torrents.csv -o /tmp/torrents.json -l /tmp/log
log - root - DEBUG - Complete: 5294016, lines/sec: 14514, tv200: 02620, tv404: 03385, ep200: 38142, ep404: 05494
log - root - DEBUG - Complete in 364.95
user@mbpro$ wc -l /tmp/torrents.json
192036 /tmp/torrents.json
user@mbpro$ wc -l settings/torrents_masks.json
14946 settings/torrents_masks.json
Untrained mode cons:
- less accurate results (~4%)
Untrained mode pros:
- faster batch processing (~30% speedup)
If you open trained settings/torrents_masks.json, you can see something like this:
"_a99a99_999a_a999_aaa": {
"freq": 22408,
"pattern": "{series_name}{space}s{season_no}e{episode_no}{space}{video_resolution}{space}{video_codec}{space}{release_group}",
"samples": [
"aquarius.s02e11.480p.x264-msd",
"amanchu.s01e03.480p.x264-msd",
"airmageddon.s01e03.480p.x264-msd"
],
"masks": {
"aaaaaaaa_a99a99_999a_a999_aaa": 559,
"aaaaaaa_a99a99_999a_a999_aaa": 373,
"aaaaaa_a99a99_999a_a999_aaa": 278,
"aaaaa_a99a99_999a_a999_aaa": 274,
"aaaaaaaaa_a99a99_999a_a999_aaa": 264,
"aaaa_a99a99_999a_a999_aaa": 236,
...
}
}- key must be unique - no extra requirements
- samples - this list used for pattern testing before parsing and for documentation purposes
- freq - this field counts torrents matched by this pattern in train mode
- masks dict also generated by training. Dict keys are matched torrents names masks, values - matched torrents names count.
How to transform torrent name to mask? It's not too hard
# See MaskParser::clean_title for implementation details
...
self._trans_table = str.maketrans(
"abcdefghijklmnopqrstuvwxyz"
"0123456789" +
self._ch_punkt +
self._ch_spaces,
"aaaaaaaaaaaaaaaaaaaaaaaaaa"
"9999999999" +
"p" * len(self._ch_punkt) +
"_" * len(self._ch_spaces)
)
mask = self.clean_title(title).translate(self._trans_table)