This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
This project is vaguely related to the HTML Disturbed Link Patcher. This project finds broken links, where the link patcher prevents them from being created in the process of reorganising your website.The point of this program is to check the HREF= links in your website, to make sure they are valid.
It should handle local websites, e.g. URL (Uniform Resource Locator) of the form file://localhost/E:/mindprod/index.html. Checking the local hard disk copy could be hundreds of times faster than checking the ISP (Internet Service Provider) ’s copy, at least for checking the internal links.
The process should be restartable. Further, it should retain what it has learned for future scans so it can save work rescanning unchanged pages.
It should not have a heart attack if someone uploads a new file to the website while you are scanning it.
It also produce its reports in a very simple ASCII (American Standard Code for Information Interchange) file format so that you can write your own programs to process the report file and mark or delete the links. Alternatively export your findings in CSV (Comma Separated Value) format.
There should be several threads simultaneously checking URL s, each working on a URL to a different site. This way you can get on with checking something else while you wait for a slow site to respond. Your program can monitor itself to home in on the optimal number of threads. It tries adding a thread or subtracting a thread and sees if that makes things faster or slower. It should then jitter about the optimal number of threads, which may change over time.
When a link automatically takes you somewhere else, your clone should correct your original HTML (Hypertext Markup Language) to point directly to the new location. It has to be a bit clever. You don’t want to replace valid URLs (Uniform Resource Locators) with 500 character CGI (Common Gateway Interface) references that will change the next minute. This applies only to permanent redirects, not temporary ones.
When a link is broken, your clone should try to fix it for you. For example if http://oberon.ark.com/~Zeugma has disappeared, it should try http://www.zeugma.com and http://www.zeugma.org. Failing that it might be able to make some guesses by combing the URL names in some search engine results. It marks its corrections with a special *.gif. Someone can then manually check these corrections out.
You want be able to review any changes to your HTML before they are applied and selectively turn off ones you don’t want. There are three types of change:
You want to be able to use it without Internet access, just checking the files available on local hard disk. Similarly you want to be able to limit it to just checking links within a website and to exclude regions of that website.
You should be able to give it a list of pages to check (or wildcards, or lists of directories), a list of pages to avoid (or negative wildcards) and whether you want the indirectly-linked pages also checked (/I) and whether you want subdirectories checked (/S). So for example, you might want it just to do a quick check of all your offsite amazon.com links, just checking internal links enough to effect that, i.e. not checking any gif links, internal # links, or other offsite links. It takes a long time to manually research a dead link. You don’t won’t to be told again about dead links you already know about.
You want it to be able to find orphans, files on your website with nothing pointing to them. To find these, you need to specify a list of root landing points, e.g. index.html where visitors start off. If you can’t get to a file indirectly from one of these landing points, it is an orphan. Further, all links you can get to from a landing point should be checked.
The basic way you spider is to add the master root web page URLs to a queue of web pages to be checked. Then you spawn N threads (you determine the optimal value for N by experiment) that start working, each grabbing an item off the queue to process, or waiting for one to be available. The thread then reads that web page with a CGI-GET. See file I/O Amanuensis for how. Then it uses a regex to find all the <a href=xxxxx> on that page. Keep in mind that HTML can be ugly with extra blank spaces and unexpected attributes. As it finds each href link, it adds it to the queue, but only if it is not already there. The process stops when there are no more links in the queue to check. You might want to randomise the queue so that you don’t repetitively hammer one site. Another optimisation is when you find that a domain can’t be found, you can make a note of that and avoid testing any links to it. A very clever version might validate mailto links first by ensuring the names have the proper format and a registered domain under DNS (Domain Name Service), then by starting a conversation with the target’s mail server. This could be quite tricky since your software has to simulate some of the functions of a mailserver. You don’t actually want to send mail, just find out if you probably could. You have to be able to talk to any flavour of mailserver. At the very least you could ensure the email address conforms to RFC 5322. I have code for such email validation as part of a bulk email program I wrote.
You should be able to control it completely either from the command line or from a GUI (Graphic User Interface).
Before declaring a link dead it should probe it several ways:
Broken Link Handling Options | |||
---|---|---|---|
Command | How to Process the Link | Display | HTML |
Original broken link | Defunct Inc. | <a class=offsite href=http://www.defunct.com>Defunct Inc.</a> | |
L | Leave the link alone, just mark it with comment. | Defunct Inc. | <!-- broken link L http://www.defunct.com -->
<a class=offsite href=http://www.defunct.com>Defunct Inc.</a> |
R | Repair the link, replacing it with a new one. | Defunct Inc | <!-- repaired link R http://www.defunct.com -->
<a class=offsite href=http://www.defunct.org>Defunct Inc.</a> |
F | Flag the link as broken. | Defunct Inc. | <!-- flagged link F http://www.defunct.com -->
< a class=broken href=http://www.defunct.com>Defunct Inc.</a> |
D | Deactivate the link, namely, Flag the link as broken and Remove it. | Defunct Inc. | <!-- broken link D http://www.defunct.com -->
<img src=../image/stylesheet/brokenlink.png width=32 height=32 alt=broken_link border=0 /> Defunct Inc. |
W | Wipe out the link entirely. | Defunct Inc. | <!-- broken link W http://www.defunct.com -->
Defunct Inc. |
The commands to the link fixing utility might consist of comma separated values, a command letter, a filename where the link occurs and the link itself, what goes inside the <a href=…>. The utility would sort them by filename.
The point of adding the <!--broken link comments is both for keeping an audit trail of what you did and for helping the link checker avoid pestering you about broken links that you have already dealt with and to make it easier to reprocess a link.
Since I wrote this proposal, I have discovered a number of link checking utilities. None are 100% conforming to this wish list, but Xenu Link Sleuth comes close. It’s big problem is that it ignores <applet tags and it does not mark broken links. It just finds them.
Product | Notes |
---|---|
HTMLValidator | trialware. Only checks links within a page. Primarily checks for other HTML syntax errors. |
Linkcop | Duke Engineering no longer supports it. It throws everything away and starts from scratch if you have to stop and restart it. |
NetMechanic | commercial. Variable price depending on your website size. to per URL. Also checks and repairs HTML syntax. |
Xenu Link Sleuth | free. Best program of the lot. Uses multiple threads. Will recheck links over a
period of days. Lets you configure external link checking on or off. Will work
off a local hard disk if you give it the index.html file to check, or a
URL
of the form:
file://localhost/E:\mindprod\index.html Will export to a tab separated file. Extremely fast checking of local
links. Erroneously reports every APPLET reference to a class or jar as broken.
The author has no plans to correct this. Does not fix or mark broken links, just
lists them for manual attention. On a local site, it can detect orphans, files that nothing points to. It can even detect orphan
files on a website if you give it your FTP (File Transfer Protocol)
password. |
This is one reason I tend to use specific links rather than home page links, even though they go out of date faster.
To deal with these, what you need to do is maintain a private cache of the offsite pages your links point to. Then you need to periodically check them with some sort of automated tool to make sure the link points to something roughly the same as before. If it does not, the automated tool at least has a target of the sort of page it is looking for. It might even find a similar one on a totally unrelated site, using search engines to help. The replacement(s) it finally suggests might not even be from the original author, just on the same subject.
I have also discovered that links to news stories are notoriously volatile. Almost all of them stop working within a year or two, even worse for controversial stories. It is an Orwellian world when the history of what happened melts away. This is very distressing when you use newspapers stories to back up your assertions. You are left dangling, looking as if you made it all up.
This page is posted |
http://mindprod.com/project/htmlbrokenlink.html | |
Optional Replicator mirror
|
J:\mindprod\project\htmlbrokenlink.html | |
Please read the feedback from other visitors,
or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission. | ||
Canadian
Mind
Products
IP:[65.110.21.43] Your face IP:[18.116.20.205] |
| |
Feedback |
You are visitor number | |