📜 Website Archive

Find a file

JMARyA e3c341bbca Some checks are pending ci/woodpecker/push/container-manifest Pipeline is pending Details ci/woodpecker/push/container/2 Pipeline is pending Details ci/woodpecker/push/test/1 Pipeline was successful Details ci/woodpecker/push/test/2 Pipeline was successful Details ci/woodpecker/push/test/3 Pipeline was successful Details ci/woodpecker/push/container/1 Pipeline was successful Details Merge pull request 'chore(deps): update rust crate uuid to v1.19.0' (#18 ) from renovate/uuid-1.x-lockfile into main Reviewed-on: #18		2025-12-02 20:39:46 +00:00
.woodpecker	ci: advanced rust tests	2025-11-25 23:38:52 +01:00
src	fix: urlescape filenames at import	2025-11-30 16:59:00 +01:00
.dockerignore	refactor: feature cleanup + sqlite	2025-11-15 20:48:51 +01:00
.gitignore	wip: working async fs + fixes	2025-11-20 21:39:03 +01:00
build.sh	wip: working async fs + fixes	2025-11-20 21:39:03 +01:00
Cargo.lock	Merge pull request 'chore(deps): update rust crate uuid to v1.19.0' (#18 ) from renovate/uuid-1.x-lockfile into main	2025-12-02 20:39:46 +00:00
Cargo.toml	chore(version): v0.2.1	2025-11-30 17:05:45 +01:00
cog.toml	ci: releases	2025-11-25 21:21:14 +01:00
config.yml	feat: implement outdated refetch	2025-11-17 08:01:54 +01:00
docker-compose.yml	refactor: refactor + fixes	2025-11-17 18:21:09 +01:00
flake.lock	nix	2025-09-15 18:37:23 +02:00
flake.nix	refactor: index	2025-11-23 19:22:35 +01:00
README.md	docs: add lychee use case	2025-11-23 04:50:20 +01:00
renovate.json	Add renovate.json	2025-06-21 21:49:31 +00:00

README.md

WebArc

webarc is a local website archive tool.

Archive Format

The archive format is a single directory with a sqlite metadata database and a blob store saving HTTP traffic similiar to WARCs.

Configuration

You can configure the application using a config file. Look at the config.yml file for more information.

Usage

Web Server

Start a local web server that serves a WebArc archive:

webarc serve

Archived pages are accessible under:

/s/<domain>/<path...>

For example:

/s/en.wikipedia.org/wiki/Website

returns /wiki/Website from the archived en.wikipedia.org.

Selecting Snapshots by Date

Use the time query parameter:

/s/en.wikipedia.org/wiki/Website?time=2021-05-01

HTTP Proxy

WebArc can act as a transparent HTTP proxy, allowing tools and browsers to fetch data directly from the archive without changing URLs.

Start the proxy:

webarc serve

It listens on localhost:3000 by default.

Set your browser’s HTTP proxy to:

http://localhost:3000

and websites will then load from the archive automatically.

On CLI

Many command-line tools respect the standard HTTP_PROXY / HTTPS_PROXY environment variables, allowing them to transparently read from your WebArc archive.

export HTTP_PROXY=http://localhost:3000
export HTTPS_PROXY=http://localhost:3000

Any tool using HTTP will now retrieve data from the archive instead of the live internet.

`curl`

HTTP_PROXY=http://localhost:3000 curl http://example.com/path

This returns the archived version of example.com/path.

`wget`

wget -e use_proxy=yes \
     -e http_proxy=http://localhost:3000 \
     http://example.com/

Cargo

You can archive crates from crates.io using cargo fetch:

export HTTPS_PROXY=http://localhost:3000

cargo fetch

Lychee

If you want to automatically archive every external link referenced in your documents, you can use lychee together with this archive proxy.

Lychee respects standard proxy environment variables.
Set your proxy as the HTTP(S) proxy:

export http_proxy="http://your-proxy:8080"
export https_proxy="http://your-proxy:8080"

All requests made by Lychee will now pass through the archive proxy, causing every visited URL to be archived automatically.

Lychee is quite aggressive by default. The following settings slow it down and make it behave more like a real browser, reducing load on the proxy and avoiding false negatives:

lychee --method get --max-concurrency 1 --timeout 20 --max-retries 5 --retry-wait-time 5 ./

This will:

use GET instead of HEAD
run only 1 request at a time
wait up to 20 seconds per URL
retry failed URLs up to 5 times
add a 500ms delay between requests

Every link Lychee checks will be stored by the archive proxy.

After running the command, all links found in your Markdown files (or any scanned documents) will have been archived automatically, ensuring long-term preservation and offline access.

FUSE Filesystem

You can mount the HTTPStore archive via FUSE with webarc mount <mountpoint> and browse it via standard file tools.

README.md Unescape Escape