Using curl and wget commands to download pages from web sites


One of the most versatile tools for collecting data from a server is curl. The “url” portion of the name properly suggests that the command is built to locate data through the URL (uniform resource locater) that you provide. And it doesn’t just communicate with web servers. It supports a wide variety of protocols. This includes HTTP, HTTPS, FTP, FTPS, SCP, SFTP and more. The wget command, though similar in some ways to curl, primarily supports HTTP and FTP protocols.

Using the curl command

You might use the curl command to:

  • Download files from the internet
  • Run tests to ensure that the remote server is doing what is expected
  • Do some debugging on various problems
  • Log errors for later analysis
  • Back up important files from the server

Probably the most obvious thing to do with the curl command is to download a page from a web site for review on the command line. To do this, just enter “curl” followed by the URL of the web site like this (the content below is truncated):

$ curl https://www.networkworld.com/category/linux/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  124k    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0

<!DOCTYPE html>
<!--[if lt IE 7]>  <html lang="en" class="lt-ie9 lt-ie8 lt-ie7" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://og
p.me/ns/fb#"> <![endif]-->
<!--[if IE 7]> <html lang="en" class="lt-ie10 lt-ie9 lt-ie8" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.m
e/ns/fb#"> <![endif]-->
…

You’ll see some timing data plus the content. To save the content to a file, redirect the output to a file using a command like this:

$ curl https://www.networkworld.com/category/linux/ > linux.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  124k  100  124k    0     0  23339      0  0:00:05  0:00:05 --:--:-- 30035

The downloaded file can then be viewed on your system using cat or more to see the html content or a browser to view the web page.

In the command below, a single html file is grabbed.

$ curl https://www.networkworld.com/video/series/8559/2-minute-linux-tips > linux_tips.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 79873  100 79873    0     0  56780      0  0:00:01  0:00:01 --:--:-- 56808

Any sequence of blank lines can be reduced to one with a command like this:

$ uniq linux_tips.html > linux_tips.html

More information on using curl is available in this previous post of mine: The Joy of curl

You can also get some quick help on options for using curl with the curl –help command:

$ curl --help
Usage: curl [options...] <url>
 -d, --data <data>          HTTP POST data
 -f, --fail                 Fail fast with no output on HTTP errors
 -h, --help <category>      Get help for commands
 -i, --include              Include protocol response headers in the output
 -o, --output <file>        Write to file instead of stdout
 -O, --remote-name          Write output to a file named as the remote file
 -s, --silent               Silent mode
 -T, --upload-file <file>   Transfer local FILE to destination
 -u, --user <user:password> Server user and password
 -A, --user-agent <name>    Send User-Agent <name> to server
 -v, --verbose              Make the operation more talkative
 -V, --version              Show version number and quit

This is not the full help, this menu is stripped into categories.
Use "--help category" to get an overview of all categories.
For all options use the manual or "--help all".’

Using wget

The wget command makes it easy to download a web site recursively. While the site used in the command below is a single-page web site, it provides a quick example of how this command works.

$ wget -r http://example.com/
--2023-09-19 13:07:12--  http://example.com/
Resolving example.com (example.com)... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to example.com (example.com)|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1256 (1.2K) [text/html]
Saving to: ‘example.com/index.html’

example.com/index.html        100%[=================================================>]   1.23K  --.-KB/s    in 0s

2023-09-19 13:07:12 (56.1 MB/s) - ‘example.com/index.html’ saved [1256/1256]

FINISHED --2023-09-19 13:07:12--
Total wall clock time: 0.1s
Downloaded: 1 files, 1.2K in 0s (56.1 MB/s)

The downloaded content will include a directory with the name of the URL (example.com) and containing its contents – in this case a single file.

$ ls example.com
index.html
$ head example.com/index.html
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {

If you were to run the command below (no recursion) multiple times, generations of the file will build up.

$ wget http://example.com/
$ ls -l index.html*
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.1
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.2
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.3

The no-parent option

The no-parent options ensures that the command will not ever ascend to the parent directory when retrieving content recursively so that only the files below a certain hierarchy will be downloaded.

$ wget --no-parent -r https://uushenandoah.org/how-to-become-a-member/

Wrap-up

Both curl and wget are extremely useful commands for downloading and troubleshooting web content. Check out the man pages for information on the many options available.

Copyright © 2023 IDG Communications, Inc.



Source link