knowledge/technology/applications/cli/network/wget.md

24 KiB

obj arch-wiki wiki repo
application https://wiki.archlinux.org/title/Wget https://en.wikipedia.org/wiki/Wget https://git.savannah.gnu.org/cgit/wget.git

wget

GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Wget is non-interactive, meaning that it can work in the background, while the user is not logged on. This allows you to start a retrieval and disconnect from the system, letting Wget finish the work. By contrast, most of the Web browsers require constant user's presence, which can be a great hindrance when transferring a lot of data.

Wget can follow links in HTML, XHTML, and CSS pages, to create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as "recursive downloading." While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be instructed to convert the links in downloaded files to point at the local files, for offline viewing.

Wget has been designed for robustness over slow or unstable network connections; if a download fails due to a network problem, it will keep retrying until the whole file has been retrieved. If the server supports regetting, it will instruct the server to continue the download from where it left off.

Options

Option Description
-b, --background Go to background immediately after startup. If no output file is specified via the -o, output is redirected to wget-log.
-e, --execute command Execute command as if it were a part of .wgetrc. A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. If you need to specify more than one wgetrc command, use multiple instances of -e.

Logging Options

Option Description
-o, --output-file=logfile Log all messages to logfile. The messages are normally reported to standard error.
-a, --append-output=logfile Append to logfile. This is the same as -o, only it appends to logfile instead of overwriting the old log file.
-q, --quiet Turn off Wget's output.
-i, --input-file=file Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.). If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line. However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding "<base href="url">" to the documents or by specifying --base=url on the command line. If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file's location will be implicitly used as base href if none was specified.
-B, --base=URL Resolves relative links using URL as the point of reference, when reading links from an HTML file specified via the -i/--input-file option (together with --force-html, or when the input file was fetched remotely from a server describing it as HTML). This is equivalent to the presence of a "BASE" tag in the HTML input file, with URL as the value for the "href" attribute.

Download Options

Option Description
-t, --tries=number Set number of tries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like "connection refused" or "not found" (404), which are not retried.
-O, --output-document=file The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If - is used as file, documents will be printed to standard output, disabling link conversion. (Use ./- to print to a file literally named -.)
--backups=backups Before (over)writing a file, back up an existing file by adding a .1 suffix (_1 on VMS) to the file name. Such backup files are rotated to .2, .3, and so on, up to backups (and lost beyond that)
-c, --continue Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program.
--show-progress Force wget to display the progress bar in any verbosity.
-T, --timeout=seconds Set the network timeout to seconds seconds.
--limit-rate=amount Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. For example, --limit-rate=20k will limit the retrieval rate to 20KB/s. This is useful when, for whatever reason, you don't want Wget to consume the entire available bandwidth.
-w, --wait=seconds Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix.
--waitretry=seconds If you don't want Wget to wait between every retrieval, but only between retries of failed downloads, you can use this option. Wget will use linear backoff, waiting 1 second after the first failure on a given file, then waiting 2 seconds after the second failure on that file, up to the maximum number of seconds you specify.
--random-wait Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using the --wait option, in order to mask Wget's presence from such analysis.
--user=user, --password=password Specify the username and password for both FTP and HTTP file retrieval.
--ask-password Prompt for a password for each connection established.

Directory Options

Option Description
-nH, --no-host-directories Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
--cut-dirs=number Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.
-P, --directory-prefix=prefix Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).

HTTP Options

Option Description
--no-cookies Disable the use of cookies.
--load-cookies file Load cookies from file before the first HTTP retrieval. file is a textual file in the format originally used by Netscape's cookies.txt file.
--save-cookies file Save cookies to file before exiting. This will not save cookies that have expired or that have no expiry time (so-called "session cookies"), but also see --keep-session-cookies.
--keep-session-cookies When specified, causes --save-cookies to also save session cookies. Session cookies are normally not saved because they are meant to be kept in memory and forgotten when you exit the browser. Saving them is useful on sites that require you to log in or to visit the home page before you can access some pages. With this option, multiple Wget runs are considered a single browser session as far as the site is concerned.
--header=header-line Send header-line along with the rest of the headers in each HTTP request. The supplied header is sent as-is, which means it must contain name and value separated by colon, and must not contain newlines.
--proxy-user=user, --proxy-password=password Specify the username user and password password for authentication on a proxy server. Wget will encode them using the "basic" authentication scheme.
--referer=url Include 'Referer: url' header in HTTP request. Useful for retrieving documents with server-side processing that assume they are always being retrieved by interactive web browsers and only come out properly when Referer is set to one of the pages that point to them.
-U, --user-agent=agent-string Identify as agent-string to the HTTP server.

HTTPS Options

Option Description
--no-check-certificate Don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate.
--ca-certificate=file Use file as the file with the bundle of certificate authorities ("CA") to verify the peers. The certificates must be in PEM format.
--ca-directory=directory Specifies directory containing CA certificates in PEM format.

Recursive Retrieval Options

Option Description
-r, --recursive Turn on recursive retrieving. The default maximum depth is 5.
-l, --level=depth Set the maximum number of subdirectories that Wget will recurse into to depth.
-k, --convert-links After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
-p, --page-requisites This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.