📜 Website Archive
Find a file
JMARyA 2e5b4fc3d2
Some checks are pending
ci/woodpecker/push/build Pipeline is pending
any mime
2025-02-24 19:30:20 +01:00
.woodpecker init 2024-12-29 16:51:34 +01:00
migrations ♻️ sql 2025-02-09 22:08:41 +01:00
src any mime 2025-02-24 19:30:20 +01:00
.dockerignore remove db 2024-12-29 19:35:56 +01:00
.gitignore remove db 2024-12-29 19:35:56 +01:00
Cargo.lock any mime 2025-02-24 19:30:20 +01:00
Cargo.toml any mime 2025-02-24 19:30:20 +01:00
config.toml archive index 2025-02-09 22:03:33 +01:00
docker-compose.yml fix 2025-01-02 22:56:51 +01:00
Dockerfile fix 2024-12-30 22:06:15 +01:00
README.md any mime 2025-02-24 19:30:20 +01:00

WebArc

webarc is a local website archive based on monolith.

Archive Format

A web archive is defined as a directory containing domains in this structure:

web_archive/
├─ domain.com/
│  ├─ sub/
│  │  ├─ path/
│  │  │  ├─ index_YYYY_MM_DD
├─ sub.domain.com/

Every document of this web archive can then be found at archive/domain/paths.../index_YYYY_MM_DD.

Usage

webarc provides a CLI tool to work with the archive structure.

# List domains in archive
webarc [--dir ARCHIVE] archive list [-j, --json]

# List all paths on a domain
webarc [--dir ARCHIVE] archive list [-j, --json] [DOMAIN]

# List all versions of a document
webarc [--dir ARCHIVE] archive versions [-j, --json] [DOMAIN] [PATH]

# Get a document
# `--md` will return a markdown version
webarc [--dir ARCHIVE] archive get [--md] [DOMAIN] [PATH] [VERSION]

# Archive a website
webarc [--dir ARCHIVE] archive download [URL]

Configuration

You can configure the application using a config file. Look at the config.toml file for more information.

Web Server

You can start a webserver serving an archive with webarc serve.

Archived pages can be viewed at /s/<domain>/<path..>.
For example, /s/en.wikipedia.org/wiki/Website will serve en.wikipedia.org at /wiki/Website.

To select an archive from a certain time, add ?time=YYYY-MM-DD parameter to the URL.