Broken Chain

Broken link checker analysis with R

You've found a few broken links on your blog or web site, and now you wonder whether there are more.

Here's a quick way to get a sense of how big the problem is.

You could basically take 2 different approaches:

  1. The exhaustive approach.
  2. The user-driven approach.

The Exhaustive Approach

This approach boils down to checking every link on the site.

This can quickly get out of hand depending on the size of your site. My blog isn't that large and the link checker checked close to 32,000 URLs. I loaded it into R and then broke down the data into status codes and quickly got a sense of where the problems were.

The tool I used is called LinkChecker and is in the Debian repository (for those that are using Debian). Otherwise, download *.tar.gz) and build from source.

After installing the LinkChecker, I ran this command to exhaustively go through the links on my site:

It took about 5-6 minutes to generate the output.csv file. Once you have the output file, try out the reproducible data analysis.

The output of the investigation fixed 2 main issues (misconfigured WordPress plugin, missing stylesheet) and a number of typos in links.

The User-Driven Approach

The user-driven approach involves mining your web server logs for URLs associated with bad status codes.

These are the invalid URLs that users are actually encountering.

The fundamental idea is to fix the URLs that are encountering issues rather than trying to fix all URLs. This approach makes sense for a large web site. I may try the user-driven approach in a follow-up post.


About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.