How to convert multiple websites into PDFs with Wget

You can automate this task through a simple command line utility called Wget. With the help of some scripts or applications and this tool, the article will show you how to save multiple websites into a PDF file.

There are many online tools, browser extensions and desktop plugins to turn websites into PDFs. If you often use these tools, you may encounter situations where you need to convert multiple links in one go. Doing this for each link is a waste of time and tedious.

You can automate this task through a simple command line utility called Wget. With the help of some scripts or applications and this tool, the article will show you how to save multiple websites into a PDF file.

How to use Wget to convert multiple websites into PDF

  1. Why choose Wget?
  2. Install Wget
    1. On macOS
    2. On Windows 10
  3. Save the link in a text file
  4. Set up a directory
  5. Details of the Wget commands
  6. Put the commands into use
  7. Convert HTML to PDF
    1. Windows 10
    2. MacOS

Why choose Wget?

Wget is a free software package for downloading files from the web. But it is also a perfect tool for mirroring entire websites to computers. Here are the reasons why Wget should be chosen:

  1. Wget requires no interaction, which means it can work in the background and allow you to transfer data without requiring your presence.
  2. Wget understands the directory structure of websites. Wget can track links in HTML pages, to create a local version of the remote site. In the process, it does not interfere with the robots.txt file.
  3. Wget can even work in slow or unstable network connections. If the server supports reset, it will continue to try again until the entire file moves to the computer.
  4. Wget supports proxy server. It helps reduce network load, speeds up access, and provides access behind firewalls.

Install Wget

On macOS

The fastest way to install Wget is through Homebrew. Homebrew is a package manager for macOS, which installs useful Unix applications and utilities. Refer to the article: How to install and use wget on Mac for more details. Then type:

 brew install wget 

You will get real-time installation of all tools (if any) for Wget to run on your Mac. If you already have Homebrew installed, be sure to run brew upgrade for the latest version of this utility.

On Windows 10

There are multiple versions of Wget available for Windows 10. Go to the Eternally Board to download the latest 64-bit build. Place the executable file in a directory and copy it into drive C :.

Picture 1 of How to convert multiple websites into PDFs with Wget

Now, we will add the Wget path to the system environment variable to run this tool from any directory. Navigate to Control Panel> System and click Advanced System Settings . In the window that opens, click Environment Variables .

Picture 2 of How to convert multiple websites into PDFs with Wget

Select Path in System Variables and click Edit. Then click the New button located in the upper right corner of the window. Enter C: wget and click OK.

Picture 3 of How to convert multiple websites into PDFs with Wget

Open Command Prompt and type wget-h to check if everything works. In PowerShell, type wget.exe -h to download the Wget help menu.

Picture 4 of How to convert multiple websites into PDFs with Wget

Save the link in a text file

Because when dealing with many links, pasting each one is a difficult task. Thankfully, there are browser extensions that can help you accomplish this task.

  1. Link Klipper: Extract all links on a web page as a CSV or TXT file. You can even drag a rectangular area on the site to selectively copy the link. The files are saved to the Download folder .

Picture 5 of How to convert multiple websites into PDFs with Wget

  1. Snap Links Plus: This utility allows you to capture all elements on a page and do everything with them. Hold down the right mouse button and drag the rectangular selection around the links. Press Ctrl key and copy the link to clipboard.

Set up a directory

Wget works like a web crawler by extracting website assets from HTML files, including JavaScript files, logos, fonts, image thumbnails and CSS. Wget also tries to create a directory structure like the remote server. Create a separate directory for Wget downloads to save web pages and also to avoid clutter.

On Mac Terminal or in the Windows Command Prompt, type:

 mkdir Wgetdown 

This step creates a new folder in the Home folder . You can name it whatever you want for it. Next, type:

 cd Wgetdown 

Change directory. This changes the current working directory to Wgetdown.

Details of the Wget commands

After creating the directory, we will use the actual Wget command:

 wget -E -H -k -K -p -i [Đường dẫn đến file văn bản của bạn] 

Wget uses GNU getopt to handle command line arguments. Each option has 2 versions, one long one short. The long option is convenient to remember but takes time to type. You can also combine different types of options. Let's dive into the details of these options:

  1. -E (–adjust-extension) : If a file of type 'app / xhtml + xml' or 'text / html' is downloaded and the URL does not end with HTML, this option will append the HTML to the file name.
  2. -H (–span-hosts) : When trying to retrieve links, no one wants to retrieve loads of unnecessary data. You want Wget to only follow specific links. This option enables Host Spanning, which allows Wget to run recursively to access any server referenced by a link. For example, images are brought from another host.
  3. -p (–page-requisites) : This option downloads all the files needed to display the HTML page. It includes audio, reference stylesheets, and inline images.
  4. -k (–convert-links) : This option converts the links in the document to make them suitable for offline viewing. It includes links to style sheets, hyperlinks to non-HTML content, embedded images, etc.
  5. -K (–backup-converted) : When converting files, this option backs up the original version with the .orig suffix .
  6. -i (–input-file) : Read the URL from the path to a local file or an external file.

Put the commands into use

To show these commands in practice, consider using a website called Writing Workflows (link: https://processedword.net/writing-workflows/index.html# ). This guide includes a table of contents with links to individual chapters. The ultimate goal is that you want to create a separate PDF of those sections.

Step 1 : Open Terminal and create a new folder, as discussed above.

Picture 6 of How to convert multiple websites into PDFs with Wget

Step 2 : Use the Link Klipper extension to save the links as a text file. Save the file to the Downloads folder .

Picture 7 of How to convert multiple websites into PDFs with Wget

Step 3 : While you use the Wgetdown folder, enter:

 wget -E -H -k -K -p -i /Users/rahulsaigal/Downloads/links.txt 

Picture 8 of How to convert multiple websites into PDFs with Wget

Step 4 : Press Enter . Wait for the process to complete.

Picture 9 of How to convert multiple websites into PDFs with Wget

Step 5 : Navigate to the Wgetdown folder . You will see the processedword.net directory of the main domain with all of the site's assets and chapter1.html.

Picture 10 of How to convert multiple websites into PDFs with Wget

Convert HTML to PDF

Converting a website into PDF is quite simple. But letting them look like the original site is a daunting task. The creation of a desired outcome depends on:

  1. System fonts available.
  2. Integrity of HTML markup.
  3. Essential features that support floating tables or images.
  4. Table of contents options.

Windows 10

PrinceXML is a fast application for converting HTML files to PDF. It allows you to type, format, and print HTML content with configurable layouts and supports web standards. It comes with many useful fonts and also allows you to customize the PDF output. This application is free for non-commercial use only.

Picture 11 of How to convert multiple websites into PDFs with Wget

MacOS

On a Mac, you can create an Automator service to convert a batch of HTML files to PDF. Open Automator and create a Quick Action document. Set service options to receive files or folders from Finder. Next, drag in Run Shell Script and set the Pass input option as an argument. Then, paste this script into the body:

 for theFileToProcess in "$@" do cupsfilter "$theFileToProcess" > "${theFileToProcess%.*}.pdf" done 

Save the file as HTML2PDF.

Picture 12 of How to convert multiple websites into PDFs with Wget

Now, select all the HTML files in the Finder. Right click and select Services> HTML2PDF . Wait a moment to convert all the files.

Picture 13 of How to convert multiple websites into PDFs with Wget

At first glance, the steps involved in converting many websites into PDFs seem complicated. But once you understand the steps and procedures, this will save time in the long run. You don't need to spend any expensive web sign-up or PDF converter.

If you're looking to turn a web page into a PDF, read the article: Save the entire site's content as a PDF for more details.

Hope you are succesful.

Update 17 September 2019
Category

System

Mac OS X

Hardware

Game

Tech info

Technology

Science

Life

Application

Electric

Program

Mobile