📜 Website Archive
Find a file
JMARyA 07b1bb30a8
Some checks are pending
ci/woodpecker/push/container-manifest Pipeline is pending
ci/woodpecker/push/container/2 Pipeline is pending
ci/woodpecker/push/test/1 Pipeline was successful
ci/woodpecker/push/test/2 Pipeline was successful
ci/woodpecker/push/test/3 Pipeline was successful
ci/woodpecker/push/container/1 Pipeline was successful
ci: advanced rust tests
2025-11-25 23:38:52 +01:00
.woodpecker ci: advanced rust tests 2025-11-25 23:38:52 +01:00
src docs: fix openapi doc 2025-11-25 23:38:25 +01:00
.dockerignore refactor: feature cleanup + sqlite 2025-11-15 20:48:51 +01:00
.gitignore wip: working async fs + fixes 2025-11-20 21:39:03 +01:00
build.sh wip: working async fs + fixes 2025-11-20 21:39:03 +01:00
Cargo.lock feat: impl api for domains and paths 2025-11-22 23:58:00 +01:00
Cargo.toml feat: impl api for domains and paths 2025-11-22 23:58:00 +01:00
cog.toml ci: releases 2025-11-25 21:21:14 +01:00
config.yml feat: implement outdated refetch 2025-11-17 08:01:54 +01:00
docker-compose.yml refactor: refactor + fixes 2025-11-17 18:21:09 +01:00
flake.lock nix 2025-09-15 18:37:23 +02:00
flake.nix refactor: index 2025-11-23 19:22:35 +01:00
README.md docs: add lychee use case 2025-11-23 04:50:20 +01:00
renovate.json Add renovate.json 2025-06-21 21:49:31 +00:00

WebArc

webarc is a local website archive tool.

Archive Format

The archive format is a single directory with a sqlite metadata database and a blob store saving HTTP traffic similiar to WARCs.

Configuration

You can configure the application using a config file. Look at the config.yml file for more information.

Usage

Web Server

Start a local web server that serves a WebArc archive:

webarc serve

Archived pages are accessible under:

/s/<domain>/<path...>

For example:

/s/en.wikipedia.org/wiki/Website

returns /wiki/Website from the archived en.wikipedia.org.

Selecting Snapshots by Date

Use the time query parameter:

/s/en.wikipedia.org/wiki/Website?time=2021-05-01

HTTP Proxy

WebArc can act as a transparent HTTP proxy, allowing tools and browsers to fetch data directly from the archive without changing URLs.

Start the proxy:

webarc serve

It listens on localhost:3000 by default.

Set your browsers HTTP proxy to:

http://localhost:3000

and websites will then load from the archive automatically.

On CLI

Many command-line tools respect the standard HTTP_PROXY / HTTPS_PROXY environment variables, allowing them to transparently read from your WebArc archive.

export HTTP_PROXY=http://localhost:3000
export HTTPS_PROXY=http://localhost:3000

Any tool using HTTP will now retrieve data from the archive instead of the live internet.

curl
HTTP_PROXY=http://localhost:3000 curl http://example.com/path

This returns the archived version of example.com/path.

wget
wget -e use_proxy=yes \
     -e http_proxy=http://localhost:3000 \
     http://example.com/
Cargo

You can archive crates from crates.io using cargo fetch:

export HTTPS_PROXY=http://localhost:3000

cargo fetch
Lychee

If you want to automatically archive every external link referenced in your documents, you can use lychee together with this archive proxy.

Lychee respects standard proxy environment variables.
Set your proxy as the HTTP(S) proxy:

export http_proxy="http://your-proxy:8080"
export https_proxy="http://your-proxy:8080"

All requests made by Lychee will now pass through the archive proxy, causing every visited URL to be archived automatically.

Lychee is quite aggressive by default. The following settings slow it down and make it behave more like a real browser, reducing load on the proxy and avoiding false negatives:

lychee --method get --max-concurrency 1 --timeout 20 --max-retries 5 --retry-wait-time 5 ./

This will:

  • use GET instead of HEAD
  • run only 1 request at a time
  • wait up to 20 seconds per URL
  • retry failed URLs up to 5 times
  • add a 500ms delay between requests

Every link Lychee checks will be stored by the archive proxy.

After running the command, all links found in your Markdown files (or any scanned documents) will have been archived automatically, ensuring long-term preservation and offline access.

FUSE Filesystem

You can mount the HTTPStore archive via FUSE with webarc mount <mountpoint> and browse it via standard file tools.