update
Some checks failed
ci/woodpecker/push/build Pipeline failed

This commit is contained in:
JMARyA 2025-01-02 19:00:47 +01:00
parent 0f6e5f5b10
commit 8df8edeeca
Signed by: jmarya
GPG key ID: 901B2ADDF27C2263
15 changed files with 591 additions and 124 deletions

View file

@ -1,14 +1,47 @@
# WebArc
`webarc` is a local website archive based on [monolith](https://github.com/Y2Z/monolith).
## Configuration
You can configure the application using environment variables:
## Archive Format
A web archive is defined as a directory containing domains in this structure:
- `$ROUTE_INTERNAL` : Rewrite links to point back to the archive itself
- `$DOWNLOAD_ON_DEMAND` : Download missing routes with monolith on demand
- `$BLACKLIST_DOMAINS` : Blacklisted domains (Comma-seperated regex, example: `google.com,.*.youtube.com`)
```
web_archive/
├─ domain.com/
│ ├─ sub/
│ │ ├─ path/
│ │ │ ├─ index_YYYY_MM_DD.html
├─ sub.domain.com/
```
Every document of this web archive can then be found at `archive/domain/paths.../index_YYYY_MM_DD.html`.
## Usage
webarc provides a CLI tool to work with the archive structure.
```sh
# List domains in archive
webarc [--dir ARCHIVE] archive list [-j, --json]
# List all paths on a domain
webarc [--dir ARCHIVE] archive list [-j, --json] [DOMAIN]
# List all versions of a document
webarc [--dir ARCHIVE] archive versions [-j, --json] [DOMAIN] [PATH]
# Get a document
# `--md` will return a markdown version
webarc [--dir ARCHIVE] archive get [--md] [DOMAIN] [PATH] [VERSION]
# Archive a website
webarc [--dir ARCHIVE] archive download [URL]
```
## Configuration
You can configure the application using a config file. Look at the [config.toml](config.toml) file for more information.
## Web Server
You can start a webserver serving an archive with `webarc serve`.
Archived pages can be viewed at `/s/<domain>/<path..>`.
For example, `/s/en.wikipedia.org/wiki/Website` will serve `en.wikipedia.org` at `/wiki/Website`.