21 KiB
obj | website |
---|---|
application | https://bitmagnet.io |
bitmagnet
A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration.
Docker Compose
services:
bitmagnet:
image: ghcr.io/bitmagnet-io/bitmagnet:latest
container_name: bitmagnet
ports:
# API and WebUI port:
- "3333:3333"
# BitTorrent ports:
- "3334:3334/tcp"
- "3334:3334/udp"
restart: unless-stopped
environment:
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=postgres
# - TMDB_API_KEY=your_api_key
command:
- worker
- run
- --keys=http_server
- --keys=queue_server
# disable the next line to run without DHT crawler
- --keys=dht_crawler
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16-alpine
container_name: bitmagnet-postgres
volumes:
- ./data/postgres:/var/lib/postgresql/data
# ports:
# - "5432:5432" Expose this port if you'd like to dig around in the database
restart: unless-stopped
environment:
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=bitmagnet
- PGUSER=postgres
shm_size: 1g
healthcheck:
test:
- CMD-SHELL
- pg_isready
start_period: 20s
interval: 10s
After running docker compose up -d
you should be able to access the web interface at http://localhost:3333. The DHT crawler should have started and you should see items appear in the web UI within around a minute.
To run the bitmagnet CLI, use docker compose run bitmagnet bitmagnet command...
Configuration
postgres.host
,postgres.name
,postgres.user
,postgres.password
(default:localhost
,bitmagnet
,postgres
,empty
): Set these values to configure connection to your Postgres database.tmdb.api_key
: TMDB API Key.tmdb.enabled
(default:true
): Specify false to disable the TMDB API integration.dht_crawler.save_files_threshold
(default:100
): Some torrents contain many thousands of files, which impacts performance and uses a lot of database disk space. This parameter sets a maximum limit for the number of files saved by the crawler with each torrent.dht_crawler.save_pieces
(default:false
): If true, the DHT crawler will save the pieces bytes from the torrent metadata. The pieces take up quite a lot of space, and aren’t currently very useful, but they may be used by future features.log.level
(default:info
): Logginglog.json
(default:false
): By default logs are output in a pretty format with colors; enable this flag if you’d prefer plain JSON.
To see a full list of available configuration options using the CLI, run:
bitmagnet config show
Specifying configuration values
Configuration paths are delimited by dots. If you’re specifying configuration in a YAML file then each dot represents a nesting level, for example to configure log.json
, tmdb.api_key
and http_server.cors.allowed_origins
:
log:
json: true
tmdb:
api_key: my-api-key
http_server:
cors:
allowed_origins:
- https://example1.com
- https://example2.com
This is not a suggested configuration file, it’s just an example of how to specify configuration values.
To configure these same values with environment variables, upper-case the path and replace all dots with underscores, for example:
LOG_JSON=true \
TMDB_API_KEY=my-api-key \
HTTP_SERVER_CORS_ALLOWED_ORIGINS=https://example1.com,https://example2.com \
bitmagnet config show
Configuration precedence
In order of precedence, configuration values will be read from:
- Environment variables
config.yml
in the current working directoryconfig.yml
in the XDG-compliant config location for the current user (for example on MacOS this is~/Library/Application Support/bitmagnet/config.yml
)- Default values
Environment variables can be used to configure simple scalar types (strings, numbers, booleans) and slice types (arrays). For more complex configuration types such as maps you’ll have to use YAML configuration. bitmagnet will exit with an error if it’s unable to parse a provided configuration value.
VPN configuration
It’s recommended that you run bitmagnet behind a VPN. If you’re using Docker then gluetun
is a good solution for this, although the networking settings can be tricky.
Classifier
The classifier can be configured and customized to do things like:
- automatically delete torrents you don’t want in your index
- add custom tags to torrents you’re interested in
- customize the keywords and file extensions used for determining a torrent’s content type
- specify completely custom logic to classify and perform other actions on torrents
Background
After a torrent is crawled or imported, some further processing must be done to gather metadata, have a guess at the torrent’s contents and finally index it in the database, allowing it to be searched and displayed in the UI/API.
bitmagnet’s classifier is powered by a Domain Specific Language. The aim of this is to provide a high level of customisability, along with transparency into the classification process which will hopefully aid collaboration on improvements to the core classifier logic.
The classifier is declared in YAML format. The application includes a core classifier that can be configured, extended or completely replaced with a custom classifier. This page documents the required format.
Source precedence
bitmagnet will attempt to load classifier source code from all the following locations. Any discovered classifier source will be merged with other sources in the following order of precedence:
- the core classifier
classifier.yml
in the XDG-compliant config location for the current user (for example on MacOS this is~/Library/Application Support/bitmagnet/classifier.yml
)classifier.yml
in the current working directory- Classifier configuration
Note that multiple sources will be merged, not replaced. For example, keywords added to the classifier configuration will be merged with the core keywords.
The merged classifier source can be viewed with the CLI command bitmagnet classifier show
.
Schema
A JSON schema for the classifier is available; some editors and IDEs will be able to validate the structure of your classifier document by specifying the $schema
attribute:
$schema: bitmagnet.io/schemas/classifier-0.1.json
The classifier schema can also be viewed by running the cli command bitmagnet classifier schema
.
The classifier declaration comprises the following components:
-
Workflows
A workflow is a list of actions that will be executed on all torrents when they are classified. When no custom configuration is provided, the default workflow will be run. To use a different workflow instead, specify the classifier.workflow configuration option with the name of your custom workflow. -
Actions
An action is a piece of workflow to be executed. All actions either return an updated classification result or an error.
For example, the following action will set the content type of the current torrent to audiobook:
set_content_type: audiobook
The following action will return an unmatched error:
unmatched
And the following action will delete the current torrent being classified (returning a delete error):
delete
These actions aren’t much use on their own - we’d want to check some conditions are satisfied before setting a content type or deleting a torrent, and for this we’d use the if_else action. For example, the following action will set the content type to audiobook if the torrent name contains audiobook-related keywords, and will otherwise return an unmatched error:
if_else:
condition: "torrent.baseName.matches(keywords.audiobook)"
if_action:
set_content_type: audiobook
else_action: unmatched
The following action will delete a torrent if its name matches the list ofbanned keywords:
if_else:
condition: "torrent.baseName.matches(keywords.banned)"
if_action: delete
Actions may return the following types of error:
- An unmatched error indicates that the current action did not match for the current torrent
- A delete error indicates that the torrent should be deleted
- An unhandled error may occur, for example if the TMDB API was unreachable
Whenever an error is returned, the current classification will be terminated.
Note that a workflow should never return an unmatched error. We expect to iterate through a series of checks corresponding to each content type. If the current torrent does not match the content type being checked, we’ll proceed to the next check until we find a match; if no match can be found, the content type will be unknown. To facilitate this, we can use the find_match action.
The find_match action is a bit like a try/catch block in some programming languages; it will try to match a particular content type, and if an unmatched error is returned, it will catch the unmatched error proceed to the next check. For example, the following action will attempt to classify a torrent as an audiobook, and then as an ebook. If both checks fail, the content type will be unknown:
find_match:
# match audiobooks:
- if_else:
condition: "torrent.baseName.matches(keywords.audiobook)"
if_action:
set_content_type: audiobook
else_action: unmatched
# match ebooks:
- if_else:
condition: "torrent.files.map(f, f.extension in extensions.ebook ? f.size : - f.size).sum() > 0"
if_action:
set_content_type: ebook
else_action: unmatched
For a full list of available actions, please refer to the JSON schema.
Conditions
Conditions are used in conjunction with the if_else
action, in order to execute an action if a particular condition is satisfied.
The conditions in the examples above use CEL (Common Expression Language) expressions.
The CEL environment
CEL is already a well-documented language, so this page won’t go into detail about the CEL syntax. In the context of the bitmagnet classifier, the CEL environment exposes a number of variables:
torrent
: The current torrent being classified (protobuf type:bitmagnet.Torrent
)result
: The current classification result (protobuf type:bitmagnet.Classification
)keywords
: A map of strings to regular expressions, representing named lists of keywordsextensions
: A map of strings to string lists, representing named lists of extensionscontentType
: A map of strings to enum values representing content types (e.g.contentType.movie
,contentType.music
)fileType
: A map of strings to enum values representing file types (e.g.fileType.video
,fileType.audio
)flags
: A map of strings to the configured values of flagskb
,mb
,gb
: Variables defined for convenience, equal to the number of bytes in a kilobyte, megabyte and gigabyte respectively
For more details on the protocol buffer types, please refer to the protobuf schema.
Boolean logic (or
, and
& not
)
In addition to CEL expressions, conditions may be declared using the boolean logic operators or, and and not. For example, the following condition evaluates to true, if either the torrent consists mostly of file extensions very commonly used for music (e.g. flac
), OR if the torrent both has a name that includes music-related keywords, and consists mostly of audio files:
or:
- "torrent.files.map(f, f.extension in extensions.music ? f.size : - f.size).sum() > 0"
- and:
- "torrent.baseName.matches(keywords.music)"
- "torrent.files.map(f, f.fileType == fileType.audio ? f.size : - f.size).sum() > 0"
Note that we could also have specified the above condition using just one CEL expression, but breaking up complex conditions like this is more readable.
Keywords
The classifier includes lists of keywords associated with different types of torrents. These aim to provide a simpler alternative to regular expressions, and the classifier will compile all keyword lists to regular expressions that can be used within CEL expressions. In order for a keyword to match, it must appear as an isolated token in the test string - that is, it must be either at the beginning or preceded by a non-word character, and either at the end or followed by a non-word character.
Reserved characters in the syntax are:
parentheses (
and )
enclose a group
|
is an OR
operator
*
is a wildcard operator
?
makes the previous character or group optional
+
specifies one or more of the previous character
#
specifies any number
specifies any non-word or non-number character
For example, to define some music- and audiobook-related keywords:
keywords:
music: # define music-related keywords
- music # all letters are case-insensitive, and must be defined in lowercase unless escaped
- discography
- album
- \V.?\A # escaped letters are case-sensitive; matches "VA", "V.A" and "V.A.", but not "va"
- various artists # matches "various artists" and "Various.Artists"
audiobook: # define audiobook-related keywords
- (audio)?books?
- (un)?abridged
- narrated
- novels?
- (auto)?biograph(y|ies) # matches "biography", "autobiographies" etc.
If you’d rather use plain old regular expressions, the CEL syntax supports that too, for example torrent.baseName.matches("^myregex$")
.
Extensions
The classifier includes lists of file extensions associated with different types of content. For example, to identify torrents of type comic by their file extensions, the extensions are first declared:
extensions:
comic:
- cb7
- cba
- cbr
- cbt
- cbz
The extensions can now be used as part of a condition within an if_else
action:
if_else:
condition: "torrent.files.map(f, f.extension in extensions.comic ? f.size : - f.size).sum() > 0"
if_action:
set_content_type: comic
else_action: unmatched
Flags
Flags can be used to configure workflows. In order to use a flag in a workflow, it must first be defined. For example, the core classifier defines the following flags that are used in the default workflow:
flag_definitions:
tmdb_enabled: bool
delete_content_types: content_type_list
delete_xxx: bool
These flags can be referenced within CEL expressions, for example to delete adult content if the delete_xxx
flag is set to true:
if_else:
condition: "flags.delete_xxx && result.contentType == contentType.xxx"
if_action: delete
Configuration
The classifier can be customized by providing a classifier.yml
file in a supported location as described above. If you only want to make some minor modifications, it may be convenient to specify these using the main application configuration instead, by providing values in either config.yml
or as environment variables. The application configuration exposes some but not all properties of the classifier.
For example, in your config.yml
you could specify:
classifier:
# specify a custom workflow to be used:
workflow: custom
# add to the core list of music keywords:
keywords:
music:
- my-custom-music-keyword
# add a file extension to the list of audiobook-related extensions:
extensions:
audiobook:
- abc
# auto-delete all comics
flags:
delete_content_types:
- comics
Or as environment variables you could specify:
TMDB_ENABLED=false \ # disable the TMDB API integration
CLASSIFIER_WORKFLOW=custom \ # specify a custom workflow to be used
CLASSIFIER_DELETE_XXX=true \ # auto-delete all adult content
bitmagnet worker run --all
Validation
The classifier source is compiled on initial load, and all structural and syntax errors should be caught at compile time. If there are errors in your classifier source, bitmagnet should exit with an error message indicating the location of the problem.
Testing on individual torrents
You can test the classifier on an individual torrent or torrents using the bitmagnet process CLI command:
bitmagnet process --infoHash=aaaaaaaaaaaaaaaaaaaa --infoHash=bbbbbbbbbbbbbbbbbbbb
Reclassify all torrents
The classifier is being updated regularly, and to reclassify already-crawled torrents you’ll need to run the CLI and queue them for reprocessing.
For context: after torrents are crawled or imported, they won’t show up in the UI straight away. They must first be “processed” by the job queue. This involves a few steps:
- The classifier attempts to classify the torrent (determine its content type, and match it to a known piece of content)
- The search index for the torrent is built
- The torrent content record is saved to the database
The reprocess command will re-queue torrents to allow the latest updates to be applied to their content records.
To reprocess all torrents in your index, simply run bitmagnet reprocess
. If you’ve indexed a lot of torrents, this will take a while, so there are a few options available to control exactly what gets reprocessed:
apisDisabled
: Disable API calls during classification. This makes the classifier run a lot faster, but disables identification with external services such as TMDB (metadata already gathered from external APIs is not lost).contentType
: Only reprocess torrents of a certain content type. For example,bitmagnet reprocess --contentType movie
will only reprocess movies. Multiple content types can be comma separated, andnull
refers to torrents of unknown content type.orphans
: Only reprocess torrents that have no content record.classifyMode
: This controls how already matched torrents are handled.default
: Only attempt to match previously unmatched torrentsrematch
: Ignore any pre-existing match and always classify from scratch (A torrent is “matched” if it’s associated with a specific piece of content from one of the API integrations, currently only TMDB)
Practical use cases and examples
Auto-delete specific content types
The default workflow provides a flag that allows for automatically deleting specific content types. For example, to delete all comic, software and xxx torrents:
flags:
delete_content_types:
- comic
- software
- xxx
Auto-deleting adult content has been one of the most requested features. For convenience, this is exposed as the configuration option classifier.delete_xxx
, and can be specified with the environment variable CLASSIFIER_DELETE_XXX=true
.
Auto-delete torrents containing specific keywords
Any torrents containing keywords in the banned list will be automatically deleted. This is primarily used for deleting CSAM content, but the list can be extended to auto-delete any other keywords:
keywords:
banned:
- my-hated-keyword
Disable the TMDB API integration
The tmdb_enabled
flag can be used to disable the TMDB API integration:
flags:
tmdb_enabled: false
For convenience, this is also exposed as the configuration option tmdb.enabled
, and can be specified with the environment variable $TMDB_ENABLED=false
.
The apis_enabled
flag has the same effect, disabling TMDB and any future API integrations:
flags:
apis_enabled: false
API integrations can also be disabled for individual classifier runs, without disabling them globally, by passing the --apisDisabled
flag to the reprocess command.
Extend the default workflow with custom logic
Custom workflows can be added in the workflows section of the classifier document. It is possible to extend the default workflow by using the run_workflow
action within your custom workflow, for example:
workflows:
custom:
- <my custom action to be executed before the default workflow>
- run_workflow: default
- <my custom action to be executed after the default workflow>
A concrete example of this is adding tags to torrents based on custom criteria.
Use tags to create custom torrent categories
Is there a category of torrent you’re interested in that isn’t captured by one of the core content types? Torrent tags are intended to capture custom categories and content types.
Let’s imagine you’d like to surface torrents containing interesting documents. The interesting documents have specific file extensions, and their filenames contain specific keywords. Let’s create a custom action to tag torrents containing interesting documents:
# define file extensions for the documents we're interested in:
extensions:
interesting_documents:
- doc
- docx
- pdf
# define keywords that must be present in the filenames of the interesting documents:
keywords:
interesting_documents:
- interesting
- fascinating
# extend the default workflow with a custom workflow to tag torrents containing interesting documents:
workflows:
custom:
# first run the default workflow:
- run_workflow: default
# then add the tag to any torrents containing interesting documents:
- if_else:
condition: "torrent.files.filter(f, f.extension in extensions.interesting_documents && f.basePath.matches(keywords.interesting_documents)).size() > 0"
if_action:
add_tag: interesting-documents
To specify that the custom workflow should be used, remember to specify the classifier.workflow
configuration option, e.g. CLASSIFIER_WORKFLOW=custom bitmagnet worker run --all
.