Check a site for 404 errors with wget


2013-10-17

No one wants broken links on their site. Here's an easy way to check for 404 errors among the links on your website. First, you'll want to run this wget command:

wget -o ~/output.txt -r -l 10 --spider http://yoursite.com

This will run wget recursively up to ten links deep. This means it finds each link on http://yoursite.com and tests if they exist, then checks each link on the pages it just tested up to a path of 10 pages from the original. The --spider flag tells wget to refrain from downloading the pages, but just check to see if they exist.

Be aware that this may take a long time depending on the number of pages on your site. One site I crawled took multiple hours, checking over 5,000 dynamically build pages. This site takes about 3 seconds to crawl with wget with over 300 static urls to check.

To look through the output file for 404 errors, use this chain of commands:

grep -B 2 "404 Not Found" ~/output.txt | grep http:// | cut -d " " -f 4 | sort -u

This will search the file for "404 Not Found", then grab the relevant url and display only the unique urls that were encountered.

You can also feed wget a cookie if you have a site that uses that for authentication. I used this a Chrome extension called cookie.txt export to grab my session cookie. It even outputs the cookies into a wget-compatible text block.

You can easily adjust the second command to look for other http error codes by replacing the "404 Not Found" with your desired code ("500 Internal Server Error", etc).