|
Some checks are pending
ci/woodpecker/push/container-manifest Pipeline is pending
ci/woodpecker/push/container/2 Pipeline is pending
ci/woodpecker/push/test/1 Pipeline was successful
ci/woodpecker/push/test/2 Pipeline was successful
ci/woodpecker/push/test/3 Pipeline was successful
ci/woodpecker/push/container/1 Pipeline was successful
|
||
|---|---|---|
| .woodpecker | ||
| src | ||
| .dockerignore | ||
| .gitignore | ||
| build.sh | ||
| Cargo.lock | ||
| Cargo.toml | ||
| cog.toml | ||
| config.yml | ||
| docker-compose.yml | ||
| flake.lock | ||
| flake.nix | ||
| README.md | ||
| renovate.json | ||
WebArc
webarc is a local website archive tool.
Archive Format
The archive format is a single directory with a sqlite metadata database and a blob store saving HTTP traffic similiar to WARCs.
Configuration
You can configure the application using a config file. Look at the config.yml file for more information.
Usage
Web Server
Start a local web server that serves a WebArc archive:
webarc serve
Archived pages are accessible under:
/s/<domain>/<path...>
For example:
/s/en.wikipedia.org/wiki/Website
returns /wiki/Website from the archived en.wikipedia.org.
Selecting Snapshots by Date
Use the time query parameter:
/s/en.wikipedia.org/wiki/Website?time=2021-05-01
HTTP Proxy
WebArc can act as a transparent HTTP proxy, allowing tools and browsers to fetch data directly from the archive without changing URLs.
Start the proxy:
webarc serve
It listens on localhost:3000 by default.
Set your browser’s HTTP proxy to:
http://localhost:3000
and websites will then load from the archive automatically.
On CLI
Many command-line tools respect the standard HTTP_PROXY / HTTPS_PROXY environment variables, allowing them to transparently read from your WebArc archive.
export HTTP_PROXY=http://localhost:3000
export HTTPS_PROXY=http://localhost:3000
Any tool using HTTP will now retrieve data from the archive instead of the live internet.
curl
HTTP_PROXY=http://localhost:3000 curl http://example.com/path
This returns the archived version of example.com/path.
wget
wget -e use_proxy=yes \
-e http_proxy=http://localhost:3000 \
http://example.com/
Cargo
You can archive crates from crates.io using cargo fetch:
export HTTPS_PROXY=http://localhost:3000
cargo fetch
Lychee
If you want to automatically archive every external link referenced in your documents, you can use lychee together with this archive proxy.
Lychee respects standard proxy environment variables.
Set your proxy as the HTTP(S) proxy:
export http_proxy="http://your-proxy:8080"
export https_proxy="http://your-proxy:8080"
All requests made by Lychee will now pass through the archive proxy, causing every visited URL to be archived automatically.
Lychee is quite aggressive by default. The following settings slow it down and make it behave more like a real browser, reducing load on the proxy and avoiding false negatives:
lychee --method get --max-concurrency 1 --timeout 20 --max-retries 5 --retry-wait-time 5 ./
This will:
- use
GETinstead ofHEAD - run only 1 request at a time
- wait up to 20 seconds per URL
- retry failed URLs up to 5 times
- add a 500ms delay between requests
Every link Lychee checks will be stored by the archive proxy.
After running the command, all links found in your Markdown files (or any scanned documents) will have been archived automatically, ensuring long-term preservation and offline access.
FUSE Filesystem
You can mount the HTTPStore archive via FUSE with webarc mount <mountpoint> and browse it via standard file tools.