📜 Website Archive
|
||
---|---|---|
.woodpecker | ||
migrations | ||
src | ||
.dockerignore | ||
.gitignore | ||
Cargo.lock | ||
Cargo.toml | ||
config.toml | ||
docker-compose.yml | ||
Dockerfile | ||
README.md |
WebArc
webarc
is a local website archive based on monolith.
Archive Format
A web archive is defined as a directory containing domains in this structure:
web_archive/
├─ domain.com/
│ ├─ sub/
│ │ ├─ path/
│ │ │ ├─ index_YYYY_MM_DD.html
├─ sub.domain.com/
Every document of this web archive can then be found at archive/domain/paths.../index_YYYY_MM_DD.html
.
Usage
webarc provides a CLI tool to work with the archive structure.
# List domains in archive
webarc [--dir ARCHIVE] archive list [-j, --json]
# List all paths on a domain
webarc [--dir ARCHIVE] archive list [-j, --json] [DOMAIN]
# List all versions of a document
webarc [--dir ARCHIVE] archive versions [-j, --json] [DOMAIN] [PATH]
# Get a document
# `--md` will return a markdown version
webarc [--dir ARCHIVE] archive get [--md] [DOMAIN] [PATH] [VERSION]
# Archive a website
webarc [--dir ARCHIVE] archive download [URL]
Configuration
You can configure the application using a config file. Look at the config.toml file for more information.
Web Server
You can start a webserver serving an archive with webarc serve
.
Archived pages can be viewed at /s/<domain>/<path..>
.
For example, /s/en.wikipedia.org/wiki/Website
will serve en.wikipedia.org
at /wiki/Website
.
To select an archive from a certain time, add ?time=YYYY-MM-DD
parameter to the URL.