image provider

Broken Links


View the latest version of this manual online at http://mindprod.com/application/brokenlinks.manual.html.
Introduction Repairing Broken Links
Why use Xenu? Automatically Repairing Redirects
How to Use Xenu CSVReplaceURLs
Configuring Brokenlinks TidyURLs
Running Brokenlinks Getting Fancy
Presumed Good File Troubleshooting
Leave File Futures
Sample Text Report Acquiring BrokenLinks
Sample HTML Export Links
Links Presumed Good

Introduction

BrokenLinks is a tool to help you find and track broken links on your website, namely URLs (Uniform Resource Locators) that no longer point to anything useful. It is a back end to the Xenu broken link detector that compensates for XENU’s weakness of overwhelming you with reports of links that are not really broken. You get the basic idea. BrokenLinks whittles XENU’s giant list of broken links to the ones you should look at first. This saves you immense amounts of time researching links that are not really broken.
Both XENU and BrokenLinks share a common limitation. They can’t detect a broken link that has been redirected to a working place-holder site, e.g. one advertising that the domain is up for sale. Similarly, some sites just quietly redirect all broken links to the home page. BrokenLinks cannot detect that. Most embarrassingly, BrokenLinks can’t detect a domain bought out by a pornography company. You can still have people threaten to sue or kill you for deliberately trying to send to them to a porn site.

Why use Xenu?

Finding the broken links is only 10% of the work. Fixing them is what is so labour intensive. If you let your website deteriorate with broken links, visitors become frustrated and stop visiting. Having clean links encourages Google to take your site more cleanly.

How to Use Xenu

download and install a free copy of a special version of Xenu Link Sleuth, or get a copy from my website download

First you spider your local copy of your website with Xenu. Read the Xenu documentation on how to do that. You first have to be sure XENU is working properly before BrokenLinks will work. Use XENU directly to find orphans.

Once you are pretty sure you have XENU configured correctly, run it on your local website, with external link checking turned on.

Be careful to verify the check external links option is on at the very last moment before you start the spidering.
When it has finished spidering your website and checking all the links, click Export Page Map to TAB-separated File. (Don’t confuse this with Export to TAB-separated File). You may optionally get XENU to also produce an HTML (Hypertext Markup Language) report.

Configuring Brokenlinks

Download and install a free copy of Brokenlinks.

The first time you use BrokenLinks you must configure it by creating a text file with a text editor. It will look something like this:

Configure it according to the embedded comments. Then save the file, giving it a name of the form xxxx.properties.

The properties are all pretty straightforward except for brokenForgivenessDays=7.

  1. If you have only a handful of broken links and you religiously run XENU/BrokenLinks every day, you might set brokenForgivenessDays=2, though I still set it to 6. One advantage of running every day is you stay on top of researching and repairing broken links. You are never faced with large numbers of them to fix all at once. I personally run BrokenLinks twice a day so that I test sites at different times of day, avoiding treating them as dead when they are just temporarily down for backup. Further, that way I rarely have more than a couple of links to research at any one time.
  2. If you have only a handful of broken links and you religiously run XENU/BrokenLinks twice a week, use brokenForgivenessDays=5
  3. If you don’t want to think about brokenForgivenessDays, leave this property out and accept the default: brokenForgivenessDays=7
  4. If you have only a handful of broken links and you religiously run XENU/BrokenLinks every week, use brokenForgivenessDays=8
  5. If you have hundreds of broken links and you run XENU/BrokenLinks only every once in a while, use brokenForgivenessDays=14
  6. You can experiment setting it to various values. The smaller the brokenForgivenessDays number, the sooner and the more broken links will be revealed to you. However, you will be pestered with more temporarily broken links. If you are feeling overwhelmed by broken links, increase the value to show you only the deadest links. The minimum value that makes much sense is 1. XENU itself effectively uses 0.
BrokenLinks Files
file type Description
include.html output⇒ List of broken links that have remained broken for a number of days. In HTML format so that you can embed them in an HTML page to view and research them with a browser.
brokenlinks.csv output⇒ List of broken links that have remained broken for a number of days. In CSV (Comma-Separated Value) format. so that you can further process the file with the CSV utilities.
brokenlinks.properties ⇒input Master BrokenLinks configuration file. Names and locates other files. You might rename it to some other *.properties name. You specify the name of this file on the BrokenLinks command line. It contains links to the names and locations of the other files.
DESCRIPT.ION ⇒input Optional TCC (Take Command Command line) file descriptions for the TCC Describe program.
history.bin ⇒input/output⇒ Link checking history database. In binary, not human readable. It contains a records of all the links on your website, when they were last tested good and last tested bad, (echoes of Santa Claus). It gets updated each time you run BrokenLinks with information from the XENU spider and from BrokenLink’s own slower but more reliable tests.
permanentredirects.csv output⇒ URLs that have been permanently redirected. You will likely want to update most of these to the new value with CSVReplaceURLs. Most of the redirects are:
  • Adding or removing www. from the front.
  • Adding or removing a / from the end.
  • Converting http:// to https://
presumedgood.csv ⇒input Optional list of presumed good URLs that BrokenLinks will not check because they fail even though they are actually OK.
leave.csv ⇒input Optional list of URLs that BrokenLinks will not check because you know they are broken, but you don’t want to repair them just now.
report.txt output⇒ Report from BrokenLinks on how the last run went.
temporaryredirects.csv output⇒ URLs that have been temporarily redirected. You might want to update a few of these to the new value with CSVReplaceURLs.
xenupage.csv ⇒input Output from XENU version 1.3.9 beta that BrokenLinks uses for input, created with Export Page Map to Tab separated File, not Save. The special version of XENU you want is available free. download from http://home.snafu.de/tilman/tmp/xenubeta.zip or get a copy from my website: download Install it in X:\Program Files (x86)\Xenu. (Older XENU versions will not work, even older ones marked 1.3.9 beta.) Older versions of BrokenLinks, version 2.4 and earlier, used an older version of XENU.
_O_V_E_R_V_I_E_W.txt generated by Take Command An optional one-line description of each file.

Running Brokenlinks

Now run BrokenLinks like this:
CD directory where brokenlinks.properites lives
java.exe -jar brokenlinks.jar
If you have Jet, you simplify that to:
brokenlinks.exe

You will get a report of the critical broken links to research both in text and html form in files in the current directory. Embed the html in a web page somewhere. Here is my list of broken links for mindprod.com. The layout is designed so make it easy to research the problems. You can click to get the page where the broken link is, or click to where it was trying to go.

Then research the broken links and fix them. The run XENU again, click Export Page Map to TAB-separated File and run BrokenLinks. Run this cycle at different times of the day, since some websites shutdown part of the day for maintenance. You want to catch them when they are up. Run the cycle after repairing a batch of links to see how you did. After you get the list whittled down to none, run the cycle weekly, twice weekly or daily to stay on top of the broken links. I find running it daily works best since you never get overwhelmed with work and thus are not tempted to postpone the work.

If you are pressed for time, you an also rerun BrokenLinks without a new XENU run. This will catch most of the problems you would rerunning XENU, but not all.

If you erase the history.bin file, it will automatically start over from scratch collecting history.

It is best to run BrokenLinks at various times of day so that you won’t think a site is down that is just offline for an hour each day for backup. I am a bit compulsive. I run it twice a day.

Presumed Good File

If you find a link that XENU/BrokenLinks thinks is broken, but which is actually ok, or it doesn’t matter for some reason, add it to your list of presumed good links. The presumedgood.csv CSV file will look something like this: Thereafter that presumed good link will be excluded from the broken links list.

Leave File

You may not be prepared to repair a link just now and don’t want broken links pestering you about it day after day. You can add it to this file to make BrokenLinks ignore the error. The leave.csv CSV file will look something like this:

Sample Text Report

Here is roughly what the text report that BrokenLinks produces will look like:

Sample HTML Export

Here is roughly what the combined broken links and presumed good HTML report that BrokenLinks produces will look like:

Broken Links Sorted by Error Code

There are 7 links that have been broken for at least 5 days yet to be fixed. Last revised: 2014-04-03

Broken Links by Status Code
Status Code Links To
    Linked From
500 : Internal server error http://old.richarddawkins.net/articles/3534
  /quote/religion.html
500 : Internal server error http://old.richarddawkins.net/articles/511240-religious-outlier
  /quote/religionbyroedy.html
500 : Internal server error http://old.richarddawkins.net/videos/3373-why-we-believe-in-gods
  /religion/books.html
/religion/god.html
500 : Internal server error http://old.richarddawkins.net/videos/3410-richard-dawkins-interviews-father-george-coyne
  /religion/god.html
500 : Internal server error http://old.richarddawkins.net/videos/3414-richard-dawkins-interviews-derren-brown
  /quote/religionbyroedy.html
500 : Internal server error http://old.richarddawkins.net/videos/486298-christianity-debate
  /religion/kristianity.html
500 : Internal server error http://old.richarddawkins.net/videos/512601-drunk-on-religion
  /religion/kristianity.html

Links to Leave As Is

The following links are known to be broken, but they are deliberately not being repaired for now.

There are 8 links marked to be left as is. Last revised: 2014-04-03

Links to Leave As Is
Link To
http://aztlan.net/oiltanker.htm
http://www.aztlan.net/du_deformed_iraqi_babies.htm
http://www.aztlan.net/iraqi_women_raped.htm
http://www.cpac.ca/eng/forms/index.asp?dsp=template&act=view3&pagetype=vod&lang=e&clipID=1748
http://www.discgear.com/Products/DiscGear/PID-DD20S(DiscGearStaging).aspx
http://www.enterprisedeveloper.com/jcertify/
https://www.youtube.com/watch?feature=player_embedded&v=AD32OdIOea0
https://www.kanguru.com/index.php/

Links Presumed Good

Xenu claims the following links are broken, but they have been manually found to be good. They should be manually rechecked from time to time. The problem may be an unknown SSL certificate authority which needs to be OKed manually, (a missing/unknown/uninstalled certificate root authority) or it may be the website sends the data, but with not-found status.

There are 53 links marked as presumed good despite what Xenu says. Last revised: 2014-04-03

Links Presumed Good
Link To
http://cgi.omroep.nl/cgi-bin/streams?/rnw/smac/2004/amsterdam_forum__chomsky_on_iraq_and_war_on_terror_20051216_low.rm
http://itshareware.com/index-idx_dev.htm
http://us.acer.com/ac/en/US/content/home
http://www.akademika.no/
http://www.amnesty.org/en/library/asset/AMR51/145/2004/en/b6ab0f58-d570-11dd-bb24-1fb85fe8fa05/amr511452004en.html
http://www.bechtel.com/
http://www.desmogblog.com/directory/
http://www.downloadplex.com/Submit-Software.html
http://www.glish.com/css/7.asp
http://www.gov.ph/
http://www.house.gov/representatives/find/
http://www.leadnow.ca/stop-the-sell-out
http://www.mrsmays.com/products_mrsmays.html
http://www.networksolutions.com/index-v2.jsp
http://www.networksolutions.com/whois/index.jsp
http://www.os2site.com/sw/internet/time/clock2.htm
http://www.post.at/index.htm
http://www.qantas.com.au/travel/airlines/home/au/en
http://www.similarsitecheck.com/submit_website/
http://www.telegraph.co.uk/news/yourview/1562772/David-Cameron-answers-your-questions.html
http://www.thefreedictionary.com/
http://www.thethinkingatheist.com/blog
http://www.waterman.com/en/style/pens/expert
http://xn--fdbk5d8ap9b8a8d.xn--deba0ad/%D7%94%D7%95%D7%99%D7%A4%D6%BC%D7%98_%D7%96%D7%B2%D6%B7%D7%98
https://sites.fastspring.com/excelsior/instant/jet-ent-lin32-ps
https://sites.fastspring.com/excelsior/instant/jet-ent-lin32-ss
https://sites.fastspring.com/excelsior/instant/jet-ent-win32-ps
https://sites.fastspring.com/excelsior/instant/jet-ent-win32-ss
https://sites.fastspring.com/excelsior/instant/jet-pro-lin32-bs
https://sites.fastspring.com/excelsior/instant/jet-pro-lin32-ps
https://sites.fastspring.com/excelsior/instant/jet-pro-lin32-ss
https://sites.fastspring.com/excelsior/instant/jet-pro-win32-bs
https://sites.fastspring.com/excelsior/instant/jet-pro-win32-ps
https://sites.fastspring.com/excelsior/instant/jet-pro-win32-ss
https://sites.fastspring.com/excelsior/instant/jet-std-lin32-bs
https://sites.fastspring.com/excelsior/instant/jet-std-lin32-ss
https://sites.fastspring.com/excelsior/instant/jet-std-win32-bs
https://sites.fastspring.com/excelsior/instant/jet-std-win32-ss
https://tsa.aloaha.com/
https://weblogs.java.net/blog/ixmal/archive/2008/05/introducing_jwe.html
https://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html
https://webservices.amazon.ca/onca/soap
https://webservices.amazon.cn/onca/soap
https://webservices.amazon.co.uk/onca/soap
https://webservices.amazon.com/onca/soap
https://webservices.amazon.de/onca/soap
https://webservices.amazon.es/onca/soap
https://webservices.amazon.fr/onca/soap
https://webservices.amazon.it/onca/soap
https://webservices.amazon.jp/onca/soap
https://www.atheistnexus.org/
https://www.eecs.harvard.edu/mailman/listinfo/jopt-users
https://www.phpbb.com/downloads/

SSL (Secure Sockets Layer) certificate authority which needs to be OKed manually, (a missing/unknown/uninstalled certificate root authority) or it may be the website sends the data, but with not-found status.

There is a similar file called leave.csv. presumedgood.csv is for sites/links that actually working, but for some reason Xenu or Brokenlinks thinks they are broken, most commonly because of problems with SSL. leave.csv is for sites/links that are definitely broken, but which you do not want to bother fixing just now.

Repairing Broken Links

Here are some tips to help you find a replacement link for a broken one.

Automatically Repairing Redirects

BrokenLinks can automatically repair permanently redirected URLs. Websites often reorganise and leave behind tombstones on the old page that describe where the information is now. Your browser will automatically follow these chains to find the new information. You know this has happened when the URL displayed when the page in found does not match the original. It is best to update your web pages with the new link since they browse faster by going direct to the link and because they will continue to work if the tombstone is deleted.

BrokenLinks has a feature to automatically maintain these changes for you. BrokenLinks automatically exports a redirects.csv CSV file that gives the old URL, the new URL, and the pages where the old URL appears. It is best to manually examine this list to prune any changes you don’t want to apply, e.g. Yahoo’s replacement links that go preposterously on and on and one. Then use CSVReplaceURLs to process that file and apply the changes to your local website mirror. Best take a backup before you try it out. If you generate URLs with code, import them from databases, CSVReplaceURLs will correct your website and its HTML macros embedded in comments, so the your changes will not will be undone the next time your regenerate your HTML. CSVReplaceURLs can deal with & encoded in the replacing URLs as either & or &, but it expects & to be encoded as & in the website. It also works when one URL has a trailing / and the candidate match does not.

You can use the CSVRecode utility to automatically replace URLs in CSV files as well.

Here is the TakeCommand script I use to run BrokenLinks, automatically discard some of the redirects I won’t apply, let me edit the list of both permanent and temporary links and also use them to update two CSV files, hassle.csv and air.csv.

I also scan the temporary redirects looking for redirects to pages with names containing words like error or suspended. I then manually check these out. Usually it means the website owner has not paid his ISP (Internet Service Provider) bills and the account has been suspended. Sometimes sites have died, or not paid bills and the owner or ISP redirects them to another living site, sometimes the ISP ’s or someone else’s parking site. He should use a permanent redirect, but uses a temporary one instead. I can catch these by eyeballing the list. The list is mostly just internal housekeeping junk, so I don’t scan it carefully every day. It sometimes contains broken links masquerading as temporary redirects or permanent redirects masquerading as temporary redirects.

CSVReplaceURLs

CSVReplaceURLs is a command line utility that takes only one parameter, the name of the file of redirects. e.g.
rem run csvreplaceurls to update all the redirected URLS on a website
java.exe J:\com\mindprod\csv\csvreplaceurls.jar  E:\redirects.csv
You don’t have to tell CSVReplaceURLs where your local website mirror is. The names of the files that need changing are in redirects.csv. You told it earlier when you configured BrokenLinks where your website files were and you also told XENU.

You might want to repair some of the links manually. You want to make sure the new link truly points to the original information, not some parking page. Just prune the ones you want to ignore or handle manually and feed the remainder to CSVReplaceURLs

CSVReplaceURLs presumes all your URLs are pure lower case. It won’t find them if they are mixed or all upper case, (except for the tail end path part). Some validator programs will complain about URLS not in all lower case. You can condition your website to use all lower case URLs by running TidyURLs.

TidyURLs

TidyURLs will clean up the links on your website, making sure they are lower case (just the host part). They will put quotes around URLs that are missing them. It will replace spaces in URLs with %20. There are many other cleanups and validations. l It is a command line utility that allows the switches -s for subdirectories too, -q for quiet, -v for verbose, -dry for dry run (does not actually change your files, just tells you what it would do if the -dry option were not there. It allows you to specify which files or file trees you want to process. It automatically ignores all files except *.html files. Here is how you typically use it:

Getting Fancy

I don’t expect you to follow all the detail, but here is what I do myself in postprocessing with a Take Command script. It gives you an idea of the sort of thing you can do.

Troubleshooting

In my own use of BrokenLinks, it has never misbehaved, so there is not much I can say about troubleshooting.

It works by processing all its information about links in RAM (Random Access Memory). If you had a large website, you might run out of RAM. If that happened, use a 64-bit OS (Operating System) and use the 64-bit version of Java. Make sure you have plenty of RAM and a fat pagefile.sys for virtual RAM. Then adjust the java.exe command line parameters, doubling the various RAM requesting parameters. If you have trouble, email me and I will coach you through it.

Futures

Here are various ways I hope eventually to improve BrokenLinks:
  1. Convert to Java Web Start. This will make the program easier to use by novices since it will not require configuration. The Configuration properties file will be replaced by a GUI (Graphic User Interface). The user will not have to manually allocate a directory for the history file.
  2. Remove the dependence on XENU. Handle everything it does in BrokenLinks. This will as a side effect make BrokenLinks notice local links that are in the wrong case. Wrong case links work under Windows and XENU, but fail after you upload to a Unix-based webserver.
  3. Avoid checking links that recently checked OK to vastly speed up link checking. You could then afford to do it daily or even before every upload. XENU rechecks everything from scratch every time you run it.
  4. Tools to insert warnings styles on broken links so they will have an icon next to them warning your visitors of the problem and letting them know you are aware of it.

Acquiring BrokenLinks

PackageVersionReleasedLicenceLanguageNotes 
brokenlinks
BrokenLinks
3.1 2017-03-15 free Java  
 
A 1 Website Analyser
Download BrokenLinks
Google sitemap
HTML Broken link fixer student project
HTTP
HTTP redirection
Xenu: the front end to BrokenLinks

This page is posted
on the web at:

http://mindprod.com/application/brokenlinks.manual.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\application\brokenlinks.manual.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[3.15.220.230]
You are visitor number