knowledge/technology/applications/cli/network/lychee.md

296 lines
19 KiB
Markdown
Raw Permalink Normal View History

2024-10-22 06:19:29 +00:00
---
obj: application
website: https://lychee.cli.rs
repo: https://github.com/lycheeverse/lychee
rev: 2024-10-22
---
# lychee
A fast, async link checker
Finds broken URLs and mail addresses inside Markdown, HTML, `reStructuredText`, websites and more!
## Usage
Usage: `lychee [OPTIONS] <inputs>...`
The inputs (where to get links to check from). These can be: files (e.g. `README.md`), file globs (e.g. `"~/git/*/README.md"`), remote URLs (e.g. `https://example.com/README.md`) or standard input (`-`). NOTE: Use `--` to separate inputs from options that allow multiple arguments
### Options
| Option | Description |
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `-c, --config <CONFIG_FILE>` | Configuration file to use [default: `lychee.toml`] |
| `-v, --verbose...` | Set verbosity level; more output per occurrence (e.g. `-v` or `-vv`) |
| `-q, --quiet...` | Less output per occurrence (e.g. `-q` or `-qq`) |
| `-n, --no-progress` | Do not show progress bar. This is recommended for non-interactive shells (e.g. for continuous integration) |
| `--cache` | Use request cache stored on disk at `.lycheecache` |
| `--max-cache-age <MAX_CACHE_AGE>` | Discard all cached requests older than this duration [default: 1d] |
| `--cache-exclude-status <CACHE_EXCLUDE_STATUS>` | A list of status codes that will be ignored from the cache |
| `--dump` | Don't perform any link checking. Instead, dump all the links extracted from inputs that would be checked |
| `--dump-inputs` | Don't perform any link extraction and checking. Instead, dump all input sources from which links would be collected |
| `--archive <ARCHIVE>` | Specify the use of a specific web archive. Can be used in combination with `--suggest` [possible values: wayback] |
| `--suggest` | Suggest link replacements for broken links, using a web archive. The web archive can be specified with `--archive` |
| `-m, --max-redirects <MAX_REDIRECTS>` | Maximum number of allowed redirects [default: 5] |
| `--max-retries <MAX_RETRIES>` | Maximum number of retries per request [default: 3] |
| `--max-concurrency <MAX_CONCURRENCY>` | Maximum number of concurrent network requests [default: 128] |
| `-T, --threads <THREADS>` | Number of threads to utilize. Defaults to number of cores available to the system |
| `-u, --user-agent <USER_AGENT>` | User agent [default: `lychee/0.16.1`] |
| `-i, --insecure` | Proceed for server connections considered insecure (invalid TLS) |
| `-s, --scheme <SCHEME>` | Only test links with the given schemes (e.g. https). Omit to check links with any other scheme. At the moment, we support http, https, file, and mailto |
| `--offline` | Only check local files and block network requests |
| `--include <INCLUDE>` | URLs to check (supports regex). Has preference over all excludes |
| `--exclude <EXCLUDE>` | Exclude URLs and mail addresses from checking (supports regex) |
| `--exclude-file <EXCLUDE_FILE>` | Deprecated; use `--exclude-path` instead |
| `--exclude-path <EXCLUDE_PATH>` | Exclude file path from getting checked |
| `-E, --exclude-all-private` | Exclude all private IPs from checking. Equivalent to `--exclude-private --exclude-link-local --exclude-loopback` |
| `--exclude-private` | Exclude private IP address ranges from checking |
| `--exclude-link-local` | Exclude link-local IP address range from checking |
| `--exclude-loopback` | Exclude loopback IP address range and localhost from checking |
| `--exclude-mail` | Exclude all mail addresses from checking (deprecated; excluded by default) |
| `--include-mail` | Also check email addresses |
| `--remap <REMAP>` | Remap URI matching pattern to different URI |
| `--header <HEADER>` | Custom request header |
| `-a, --accept <ACCEPT>` | A List of accepted status codes for valid links |
| `--include-fragments` | Enable the checking of fragments in links |
| `-t, --timeout <TIMEOUT>` | Website timeout in seconds from connect to response finished [default: 20] |
| `-r, --retry-wait-time <RETRY_WAIT_TIME>` | Minimum wait time in seconds between retries of failed requests [default: 1] |
| `-X, --method <METHOD>` | Request method [default: get] |
| `-b, --base <BASE>` | Base URL or website root directory to check relative URLs e.g. <https://example.com> or `/path/to/public` |
| `--basic-auth <BASIC_AUTH>` | Basic authentication support. E.g. `http://example.com username:password` |
| `--github-token <GITHUB_TOKEN>` | GitHub API token to use when checking github.com links, to avoid rate limiting [env: `$GITHUB_TOKEN`] |
| `--skip-missing` | Skip missing input files (default is to error if they don't exist) |
| `--no-ignore` | Do not skip files that would otherwise be ignored by '.gitignore', '.ignore', or the global ignore file |
| `--hidden` | Do not skip hidden directories and files |
| `--include-verbatim` | Find links in verbatim sections like `pre`- and `code` blocks |
| `--glob-ignore-case` | Ignore case when expanding filesystem path glob inputs |
| `-o, --output <OUTPUT>` | Output file of status report |
| `--mode <MODE>` | Set the output display mode. Determines how results are presented in the terminal [default: color] [possible values: plain, color, emoji] |
| `-f, --format <FORMAT>` | Output format of final status report [default: compact] [possible values: compact, detailed, json, markdown, raw] |
| `--require-https` | When HTTPS is available, treat HTTP links as errors |
| `--cookie-jar <COOKIE_JAR>` | Tell lychee to read cookies from the given file. Cookies will be stored in the cookie jar and sent with requests. New cookies will be stored in the cookie jar and existing cookies will be updated |
## Configuration
The configuration file is a TOML file that can be used to specify the options that are also available on the command line. It comes in handy when you want to specify a lot of options, or when you want to configure lychee for continuous integration as part of a repository (configuration as code).
`./lychee.toml` (in the current working directory) is used if no other configuration file is specified. Here is an example of a configuration file. Please find the latest version on Github.
```ini
############################# Display #############################
# Verbose program output
# Accepts log level: "error", "warn", "info", "debug", "trace"
verbose = "info"
# Don't show interactive progress bar while checking links.
no_progress = false
# Path to summary output file.
output = ".config.dummy.report.md"
############################# Cache ###############################
# Enable link caching. This can be helpful to avoid checking the same links on
# multiple runs.
cache = true
# Discard all cached requests older than this duration.
max_cache_age = "2d"
############################# Runtime #############################
# Number of threads to utilize.
# Defaults to number of cores available to the system if omitted.
threads = 2
# Maximum number of allowed redirects.
max_redirects = 10
# Maximum number of allowed retries before a link is declared dead.
max_retries = 2
# Maximum number of concurrent link checks.
max_concurrency = 14
############################# Requests ############################
# User agent to send with each request.
user_agent = "curl/7.83. 1"
# Website timeout from connect to response finished.
timeout = 20
# Minimum wait time in seconds between retries of failed requests.
retry_wait_time = 2
# Comma-separated list of accepted status codes for valid links.
# Supported values are:
#
# accept = ["200..=204", "429"]
# accept = "200..=204, 429"
# accept = ["200", "429"]
# accept = "200, 429"
accept = ["200", "429"]
# Proceed for server connections considered insecure (invalid TLS).
insecure = false
# Only test links with the given schemes (e.g. https).
# Omit to check links with any other scheme.
# At the moment, we support http, https, file, and mailto.
scheme = ["https"]
# When links are available using HTTPS, treat HTTP links as errors.
require_https = false
# Request method
method = "get"
# Custom request headers
headers = []
# Remap URI matching pattern to different URI.
remap = ["https://example.com http://example.invalid"]
# Base URL or website root directory to check relative URLs.
base = "https://example.com"
# HTTP basic auth support. This will be the username and password passed to the
# authorization HTTP header. See
# <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Authorization>
basic_auth = ["example.com user:pwd"]
############################# Exclusions ##########################
# Skip missing input files (default is to error if they don't exist).
skip_missing = false
# Check links inside `<code>` and `<pre>` blocks as well as Markdown code
# blocks.
include_verbatim = false
# Ignore case of paths when matching glob patterns.
glob_ignore_case = false
# Exclude URLs and mail addresses from checking (supports regex).
exclude = ['^https://www\.linkedin\.com', '^https://web\.archive\.org/web/']
# Exclude these filesystem paths from getting checked.
exclude_path = ["file/path/to/Ignore", "./other/file/path/to/Ignore"]
# URLs to check (supports regex). Has preference over all excludes.
include = ['gist\.github\.com.*']
# Exclude all private IPs from checking.
# Equivalent to setting `exclude_private`, `exclude_link_local`, and
# `exclude_loopback` to true.
exclude_all_private = false
# Exclude private IP address ranges from checking.
exclude_private = false
# Exclude link-local IP address range from checking.
exclude_link_local = false
# Exclude loopback IP address range and localhost from checking.
exclude_loopback = false
# Check mail addresses
include_mail = true
```
## GitHub Action
lychee is also available as a [GitHub Action](https://github.com/lycheeverse/lychee-action/). This way you can set up a job which regularly checks all links in your repository. If you like, it can open an issue when lychee finds problems with your links.
Here is a full example of a GitHub workflow file:
It will check all repository links once per day and create an issue in case of errors. Save this under `.github/workflows/links.yml`:
```yml
name: Links
on:
repository_dispatch:
workflow_dispatch:
schedule:
- cron: "00 18 * * *"
jobs:
linkChecker:
runs-on: ubuntu-latest
permissions:
issues: write # required for peter-evans/create-issue-from-file
steps:
- uses: actions/checkout@v4
- name: Link Checker
id: lychee
uses: lycheeverse/lychee-action@v2
- name: Create Issue From File
if: env.exit_code != 0
uses: peter-evans/create-issue-from-file@v5
with:
title: Link Checker Report
content-filepath: ./lychee/out.md
labels: report, automated issue
```
Here is how to pass the arguments.
```yml
- name: Link Checker
uses: lycheeverse/lychee-action@v2
with:
# Check all markdown, html and reStructuredText files in repo (default)
args: --base . --verbose --no-progress './**/*.md' './**/*.html' './**/*.rst'
# Use json as output format (instead of markdown)
format: json
# Use different output file path
output: /tmp/foo.txt
# Use a custom GitHub token, which
token: ${{ secrets.CUSTOM_TOKEN }}
# Don't fail action on broken links
fail: false
```
## Examples
**Check All Links In Current Directory**:
The following command recursively checks all links in all supported files inside the current directory.
```sh
lychee .
```
**Check All Links On A Website**:
```sh
lychee https://example.com
```
**Check Only Specific Files**:
```sh
lychee README.md
lychee test.html info.txt
lychee test.html info.txt https://example.com
```
**Check Links In Directories, But Block All Network Requests**:
```sh
lychee --offline path/to/directory
```
**Check Links In A Remote File**:
```sh
lychee https://raw.githubusercontent.com/lycheeverse/lychee/master/README.md
```
**Check links from stdin**:
```sh
cat test.md | lychee -
echo 'https://example.com' | lychee -
```