webarc/README.md
JMARyA 2e5b4fc3d2
Some checks are pending
ci/woodpecker/push/build Pipeline is pending
any mime
2025-02-24 19:30:20 +01:00

1.4 KiB

WebArc

webarc is a local website archive based on monolith.

Archive Format

A web archive is defined as a directory containing domains in this structure:

web_archive/
├─ domain.com/
│  ├─ sub/
│  │  ├─ path/
│  │  │  ├─ index_YYYY_MM_DD
├─ sub.domain.com/

Every document of this web archive can then be found at archive/domain/paths.../index_YYYY_MM_DD.

Usage

webarc provides a CLI tool to work with the archive structure.

# List domains in archive
webarc [--dir ARCHIVE] archive list [-j, --json]

# List all paths on a domain
webarc [--dir ARCHIVE] archive list [-j, --json] [DOMAIN]

# List all versions of a document
webarc [--dir ARCHIVE] archive versions [-j, --json] [DOMAIN] [PATH]

# Get a document
# `--md` will return a markdown version
webarc [--dir ARCHIVE] archive get [--md] [DOMAIN] [PATH] [VERSION]

# Archive a website
webarc [--dir ARCHIVE] archive download [URL]

Configuration

You can configure the application using a config file. Look at the config.toml file for more information.

Web Server

You can start a webserver serving an archive with webarc serve.

Archived pages can be viewed at /s/<domain>/<path..>.
For example, /s/en.wikipedia.org/wiki/Website will serve en.wikipedia.org at /wiki/Website.

To select an archive from a certain time, add ?time=YYYY-MM-DD parameter to the URL.