image provider

HTML Splitter & Boilerplate Refresher


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

This project is really two separate utilities, but since they have to mesh together I discuss them together. You could write either one and it would still be useful on its own. This project has several uses:
  1. splitting up giant .html documents into manageable pieces.
  2. replacing your standard headers and footers on a set of HTML (Hypertext Markup Language) files, with macro-style replacement to customise them.
  3. organize your scattered thoughts into categories.
  4. Replace computer generated tables automatically sandwiched inside hand-generated HTML.
This is a generalisation of the scheme I posted earlier. I now allow multiple fragments per document created and I allow arbitrary sections of boilerplate, not just headers and footers. I have integrated the include and document splitting logic and separated out the boilerplate regeneration.

HTML Documents, such as this one, tend to grow fat and unwieldy. At some point they need to be split into smaller documents. Have a look at methods.html. It has been split up, with a menu put on the front to link to the various pieces. Breaking up even one big document manually is at least a day’s work, by the time you get all the minor HTML adjustments done and proofread.

I would like to automate, or at least semi-automate the process. Here is how it would work. You take your monster document and insert magic tags in it showing how you want it split up. Then you run a utility that creates the new files.

There are features to deal with boilerplate, such as standard headers and footers and for customising those standard headers and footers.

The utility also generates you a menu that can be used to jump to all the different documents created.

SPLIT Tags

Here is what the magic markers you insert look like to control how the splitting is done:

However, when you propagate boilerplate you want it slightly customised. You can arrange to get it customised by embedding magic comment tags that will be expanded when you run the REFRESH utility.

Customising your boilerplate: REFRESH Tags

You can embed tags in your files that are expanded like simple macros. After the REFRESH expansion is complete, there is no sign you did not key the expansion manually.

Regenerating Boilerplate

What if your standard boilerplate you included during SPLIT changes? How can you get it reinserted at various points in all your files? How can you get it recustomised? By this point the SPLIT and THIS tags are all long gone. You need special tags inserted, that are ignored during SPLIT processing, but which are processed during REFRESH processing.
<!--! BEGIN REFRESH FROM="fred" -->...
<!--! END REFRESH FROM="fred" -->
Your boilerplate will contain THIS and MOTHER tags, that will be expanded when the text between BEGIN REFRESH and END REFRESH is replaced. You must be sure these BEGIN REFRESH and END REFRESH tags are still in place after a regeneration so that future regenerations will still work. You could do that by including BEGIN REFRESH and END REFRESH as the first and last thing of every chunk of includable boilerplate as well as wherever you want that code included.

You thus have the option of either boilerplate that is regeneratable with embedded REFRESH tags and that which should not be regenerated later, without such tags, presumably because you intend to hand customise it and don’t want your customisations overwritten.

You might invent a shortcut

<!--! INCLUDE FROM="header2" -->

that acts like a BEGIN REFRESH / END REFRESH pair. You would use it to get the boilerplate included in the first place, when you are not using SPLIT. It would be deleted after processing.

During SPLIT processing, all tags except SPLIT are ignored. They are just treated as ordinary text. They can be expanded/refreshed later with a REFRESH run. The SPLIT process leaves behind a line like this in each file which is useful in expansion of tags:

You can manually edit the INFO tag generated by SPLIT or insert it manually. This enables you to use the REFRESH utility without ever using the SPLIT utility.

THIS, MOTHER and ICON not in boiler plate text, are expanded once and cannot be refreshed since there is no tag left behind. If you want them refreshable, they must appear in boilerplate text, enclosed in REFRESH tags. In a pinch you could create a boilerplate file to include that had almost nothing in it but a THIS or MOTHER tag. It will be expanded in the context of the file where it finally appears.

Your head is probably hurting by now. The basic problem is how to run REFRESH multiple times. You want to get rid of what was included/expanded earlier, before you re-expand the tags. You need a way of identifying where earlier expansion material started and ended. You also must leave embedded notes around about how to re-expand. For your first cut, only worry about running REFRESH once and leave no trace of the tags. Once you have that working you may feel ready to tackle the problem of running REFRESH multiple times to freshen your boilerplate expansions.

Dealing With Broken Links

Manual Touchup

You as, end user, do some further manual polishing, e.g. patching <dl>…</dl> broken in the middle, adding more menu items so that the menu may point to several different places in one of your fragment files, adjusting the menu to a perfect square by making some elements multiple columns and reordering the menu items. Most other polishings can be done manually with Funduc Search and Replace. I clean up most of the loose ends with HTML Checked!HTML Validator, in a bulk validation pass of newly created files.

AutoGlue

If you are looking for still another challenge after you finish that, consider writing the reverse, a utility to temporarily glue a set of files together in one big file, so that you can easily globally edit the text using a tool like SlickEdit. The gluing process inserts magic tags so that after the edit is finished, you can quickly split them back up again. You could also use this tool to reorder a giant document.

I wrote a pair of split/glue utilities like this to help me manage Pascal source code on the PDP-11 many years ago. Except for fixing NAME links, this project is actually easier to code that to explain.

Implementation

SPLIT needs to be done logically in two passes. You might use a parser or since grammar is so simple, you could write your own. You could simplify things by insisting on standard ordering of parameters. It finds the tags create a list of objects that represent the SPLIT tags it found. You must run through this list creating a list without duplicates of all the files you will be creating, so that will know how to interpret the TO=ALL parameter. You could then proceed to produce all the files simultaneously in one pass, or one file at a time. The first method is faster; the second more parsimonious of RAM (Random Access Memory). You would not need to reparse no matter which approach you used. Everything you need about filenames and offsets would be captured on the first pass. You need to create a new mother file with a new temporary name. When you are done, rename it back. There is no recursive processing of files brought in from the outside. You just include the text. You don’t parse the included text for for commands. If the user wants them processed, he can run refresh again.

REFRESH can be handled in a single pass. When you find an INFO tag you remember the information for use in later tag expansion. You discard the INFO tag itself. When you find an INCLUDE tag, you copy in data from the FROM file and recursively process it for embedded tags, including more INCLUDES. The only file you change is the one you are processing. You don’t refresh or expand the included boilerplate files themselves. You discard the INCLUDE tag itself. If you see a BEGIN REFRESH, you discard up to and including the corresponding END REFRESH. You may discard some nested BEGIN END pairs in the process! Then you treat like an INCLUDE. Don’t automatically generate new BEGIN END REFRESH pairs. If they are wanted, they will be inside the included text. If you see a THIS, ICON or MOTHER, replace it from the information in the most recent INFO tag. Discard the tag itself. This way refreshable boilerplate can be composed of refreshable boilerplate. When your refresh your documents, the latest and greatest will be recursively refreshed.

As a first cut, you might use Funduc Search and Replace as your scanning and replacing engine. The key is the *[] regular expression marker that will match anything, e.g. the old boilerplate sandwiched between two markers and the binary replace mode that lets you insert arbitrary multiline text.

Special Purpose Solutions

Inserting all the tags to control the split is a lot of work. You could also invent splits that were clever, splitting on <h2> tags for example and taking the title of the new fragment from the heading and the file from the old anchor. The new filename could be massaged from the current filename and the anchor. While you are at it, you can squirt out a Funduc search and replace script of old and new anchors so you can fix up your entire site for links into the middle of the original big document. I used a little one shot program of that sort to split ects.html into 142 fragments.

Outstanding Questions

Just what is legal syntax? E.g. what if you find an INFO tag in some boilerplate. What if REFRESH tags are not balanced? What if you specify both ALL and NONE on a TO option? What if you specified FROM=here or FROM=here or FROM=HERE Is that a file or the HERE option?

Implementations

I have written an implementation of the boilerplate refresher problem using the general purpose htmlmacros package. An include looks like this:
<!-- macro Include jgloss/include/example.html --<
In using it, I discovered that included text often needs some minor adjustments, e.g. changing image/red.gif to ../image/red.gif. I will work on a smarter include that adjusts depending on the directory of the file in which it is being included and perhaps a general parm replacement mechanism than does not require you to compile new macros

I have also done a some specific solutions to the splitter problem — to split up large glossaries, making each <dt> its own file.


This page is posted
on the web at:

http://mindprod.com/project/htmlsplitter.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\htmlsplitter.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.188.110.150]
You are visitor number