- Herencia, propósito y creatividad confluyen sobre un manto tecnológico en los irrepetibles UMusic Hotels
- OpenAI, SoftBank, Oracle lead $500B Project Stargate to ramp up AI infra in the US
- 오픈AI, 700조원 규모 'AI 데이터센터' 프로젝트 착수··· 소프트뱅크·오라클 참여
- From Election Day to Inauguration: How Cybersecurity Safeguards Democracy | McAfee Blog
- The end of digital transformation, the rise of AI transformation
Using curl and wget commands to download pages from web sites
One of the most versatile tools for collecting data from a server is curl. The “url” portion of the name properly suggests that the command is built to locate data through the URL (uniform resource locater) that you provide. And it doesn’t just communicate with web servers. It supports a wide variety of protocols. This includes HTTP, HTTPS, FTP, FTPS, SCP, SFTP and more. The wget command, though similar in some ways to curl, primarily supports HTTP and FTP protocols.
Using the curl command
You might use the curl command to:
- Download files from the internet
- Run tests to ensure that the remote server is doing what is expected
- Do some debugging on various problems
- Log errors for later analysis
- Back up important files from the server
Probably the most obvious thing to do with the curl command is to download a page from a web site for review on the command line. To do this, just enter “curl” followed by the URL of the web site like this (the content below is truncated):
$ curl https://www.networkworld.com/category/linux/ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 124k 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0 <!DOCTYPE html> <!--[if lt IE 7]> <html lang="en" class="lt-ie9 lt-ie8 lt-ie7" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://og p.me/ns/fb#"> <![endif]--> <!--[if IE 7]> <html lang="en" class="lt-ie10 lt-ie9 lt-ie8" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.m e/ns/fb#"> <![endif]--> …
You’ll see some timing data plus the content. To save the content to a file, redirect the output to a file using a command like this:
$ curl https://www.networkworld.com/category/linux/ > linux.html % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 124k 100 124k 0 0 23339 0 0:00:05 0:00:05 --:--:-- 30035
The downloaded file can then be viewed on your system using cat or more to see the html content or a browser to view the web page.
In the command below, a single html file is grabbed.
$ curl https://www.networkworld.com/video/series/8559/2-minute-linux-tips > linux_tips.html % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 79873 100 79873 0 0 56780 0 0:00:01 0:00:01 --:--:-- 56808
Any sequence of blank lines can be reduced to one with a command like this:
$ uniq linux_tips.html > linux_tips.html
More information on using curl is available in this previous post of mine: The Joy of curl
You can also get some quick help on options for using curl with the curl –help command:
$ curl --help Usage: curl [options...] <url> -d, --data <data> HTTP POST data -f, --fail Fail fast with no output on HTTP errors -h, --help <category> Get help for commands -i, --include Include protocol response headers in the output -o, --output <file> Write to file instead of stdout -O, --remote-name Write output to a file named as the remote file -s, --silent Silent mode -T, --upload-file <file> Transfer local FILE to destination -u, --user <user:password> Server user and password -A, --user-agent <name> Send User-Agent <name> to server -v, --verbose Make the operation more talkative -V, --version Show version number and quit This is not the full help, this menu is stripped into categories. Use "--help category" to get an overview of all categories. For all options use the manual or "--help all".’
Using wget
The wget command makes it easy to download a web site recursively. While the site used in the command below is a single-page web site, it provides a quick example of how this command works.
$ wget -r http://example.com/ --2023-09-19 13:07:12-- http://example.com/ Resolving example.com (example.com)... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946 Connecting to example.com (example.com)|93.184.216.34|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1256 (1.2K) [text/html] Saving to: ‘example.com/index.html’ example.com/index.html 100%[=================================================>] 1.23K --.-KB/s in 0s 2023-09-19 13:07:12 (56.1 MB/s) - ‘example.com/index.html’ saved [1256/1256] FINISHED --2023-09-19 13:07:12-- Total wall clock time: 0.1s Downloaded: 1 files, 1.2K in 0s (56.1 MB/s)
The downloaded content will include a directory with the name of the URL (example.com) and containing its contents – in this case a single file.
$ ls example.com index.html $ head example.com/index.html <!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <style type="text/css"> body {
If you were to run the command below (no recursion) multiple times, generations of the file will build up.
$ wget http://example.com/ $ ls -l index.html* -rw-r--r--. 1 shs shs 1256 Oct 17 2019 index.html -rw-r--r--. 1 shs shs 1256 Oct 17 2019 index.html.1 -rw-r--r--. 1 shs shs 1256 Oct 17 2019 index.html.2 -rw-r--r--. 1 shs shs 1256 Oct 17 2019 index.html.3
The no-parent option
The no-parent options ensures that the command will not ever ascend to the parent directory when retrieving content recursively so that only the files below a certain hierarchy will be downloaded.
$ wget --no-parent -r https://uushenandoah.org/how-to-become-a-member/
Wrap-up
Both curl and wget are extremely useful commands for downloading and troubleshooting web content. Check out the man pages for information on the many options available.
Copyright © 2023 IDG Communications, Inc.