📜 Website Archive
Find a file
JMARyA dc10052c16
Some checks failed
ci/woodpecker/push/build Pipeline failed
update
2025-01-03 00:20:22 +01:00
.woodpecker init 2024-12-29 16:51:34 +01:00
migrations update 2025-01-02 19:00:47 +01:00
src update 2025-01-03 00:20:22 +01:00
.dockerignore remove db 2024-12-29 19:35:56 +01:00
.gitignore remove db 2024-12-29 19:35:56 +01:00
Cargo.lock update 2025-01-02 19:00:47 +01:00
Cargo.toml update 2025-01-02 19:00:47 +01:00
config.toml update 2025-01-02 19:00:47 +01:00
docker-compose.yml fix 2025-01-02 22:56:51 +01:00
Dockerfile fix 2024-12-30 22:06:15 +01:00
README.md update 2025-01-02 19:00:47 +01:00

WebArc

webarc is a local website archive based on monolith.

Archive Format

A web archive is defined as a directory containing domains in this structure:

web_archive/
├─ domain.com/
│  ├─ sub/
│  │  ├─ path/
│  │  │  ├─ index_YYYY_MM_DD.html
├─ sub.domain.com/

Every document of this web archive can then be found at archive/domain/paths.../index_YYYY_MM_DD.html.

Usage

webarc provides a CLI tool to work with the archive structure.

# List domains in archive
webarc [--dir ARCHIVE] archive list [-j, --json]

# List all paths on a domain
webarc [--dir ARCHIVE] archive list [-j, --json] [DOMAIN]

# List all versions of a document
webarc [--dir ARCHIVE] archive versions [-j, --json] [DOMAIN] [PATH]

# Get a document
# `--md` will return a markdown version
webarc [--dir ARCHIVE] archive get [--md] [DOMAIN] [PATH] [VERSION]

# Archive a website
webarc [--dir ARCHIVE] archive download [URL]

Configuration

You can configure the application using a config file. Look at the config.toml file for more information.

Web Server

You can start a webserver serving an archive with webarc serve.

Archived pages can be viewed at /s/<domain>/<path..>.
For example, /s/en.wikipedia.org/wiki/Website will serve en.wikipedia.org at /wiki/Website.

To select an archive from a certain time, add ?time=YYYY-MM-DD parameter to the URL.