HTML Splitter & Boilerplate Refresher
by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested
student project in
Java programming. This essay gives a rough overview of how it might work. It
does not describe an actual complete program. I have
no source, object,
specifications, file layouts or anything else useful to implementing this
project. Everything I have to say to help you with this project is written below.
I am
not prepared to help you implement it; I have too many other
projects of my own.
I do contract work for a living, which could include writing a program such as
this. However, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
This project is really two separate utilities, but since they have to mesh
together I discuss them together. You could write either one and it would still
be useful on its own. This project has several uses:
- splitting up giant .html documents into manageable pieces.
- replacing your standard headers and footers on a set of HTML files, with macro-style
replacement to customise them.
- organize your scattered thoughts into categories.
- Replace computer generated tables automatically sandwiched inside hand-generated
HTML.
This is a generalisation of the scheme I posted earlier. I now allow multiple
fragments per document created, and I allow arbitrary sections of boilerplate,
not just headers and footers. I have integrated the include and document
splitting logic, and separated out the boilerplate regeneration.
HTML Documents, such as this one, tend to grow fat and unwieldy. At some point
they need to be split into smaller documents. Have a look at methods.html.
It has been split up, with a menu put on the front to link to the various pieces.
Breaking up even one big document manually is at least a day’s work, by
the time you get all the minor HTML adjustments done and proofread.
I would like to automate, or at least semi-automate the process. Here is how it
would work. You take your monster document and insert magic tags in it showing
how you want it split up. Then you run a utility that creates the new files.
There are features to deal with boilerplate, such as standard headers and
footers, and for customising those standard headers and footers.
The utility also generates you a menu that can be used to jump to all the
different documents created.
SPLIT Tags
Here is what the magic markers you insert look like to control how the splitting
is done:
- TO="file" : The name of the file to be split off, without the .html
extension. You may have several chunks directed to the same file. They will just
be tacked one after the other. You can also specify a list of files separated by
commas. Note that the quotes are mandatory. That way there is no confusion
between the HERE option and a file called HERE.
TO=HERE : means leave the chunk where it is, in the mother document.
TO=ALL : means insert the fragment in all the files created, e.g. a header or
footer or other standard boilerplate.
TO=NONE : means the chunk is discarded.
- FROM="file" : The name of the file where to find the fragment to
include, without the .html extension. You always include entire files. You may
also specify a list of files separated my commas. This would typically be some
boilerplate header or footer.
FROM=HERE : means you can find the chunk immediately following. The end of it
will be marked by the next SPLIT tag or EOF whichever comes first. This is the
default. This chunk will be directed wherever the TO parameter directs it.
FROM=MENU : means you generate a menu with links to all the child files, using
their titles and icons and insert this where specified by TO.
- TITLE : description to use in a generated menu for the TO file, or metatags, or
H1 headers.
- ICON : optional icon descriptor to be used the represent the fragment in the
master menu. The height, width, and name can all be generated from the icon file
name or the icon file contents directly. See the Image
Amanuensis Project.
However when you propagate boilerplate you want it slightly customised. You can
arrange to get it customised by embedding magic comment tags that will be
expanded when you run the REFRESH utility.
Customising your boilerplate: REFRESH Tags
You can embed tags in your files that are expanded like simple macros. After the
REFRESH expansion is complete, there is no sign you did not key the expansion
manually.
-
The name of the document often appears in the document itself. You might want it
left the same to continue to refer to the mother document or you might want it
changed to refer instead to the name of the child document. You mark such places
with a <!--! THIS --> . The text
<!--! THIS --> will be replaced by swim
in the child fragment swim.html. This lets you generate strings like http://mindprod.com/swim.html
or swim.cnt . The utility looks at the current
directory to determine if extra levels of directory need to be included in the
generated file name.
-
This works just like THIS so you can make references to the fragment’s
icon in the standard headers or footers, or even the fragment body.
-
This works just like THIS so you can make references to the fragment’s
mother file in the standard headers or footers, or even the fragment body. The
file from which this was split off, or the file just above it in the document
hierarchy.
Regenerating Boilerplate
What if your standard boilerplate you included during SPLIT changes? How can you
get it reinserted at various points in all your files? How can you get it
recustomised? By this point the SPLIT and THIS tags are all long gone. You need
special tags inserted, that are ignored during SPLIT processing, but which are
processed during REFRESH processing.
...
Your boilerplate will contain THIS and MOTHER tags, that will be expanded when
the text between BEGIN REFRESH and END REFRESH is replaced. You must be sure
these BEGIN REFRESH and END REFRESH tags are still in place after a regeneration
so that future regenerations will still work. You could do that by including
BEGIN REFRESH and END REFRESH as the first and last thing of every chunk of
includable boilerplate as well as wherever you want that code included.
You thus have the option of either boilerplate that is regeneratable with
embedded REFRESH tags and that which should not be regenerated later, without
such tags, presumably because you intend to hand customise it, and don’t
want your customisations overwritten.
You might invent a shortcut
that acts like a BEGIN REFRESH / END REFRESH pair. You would use it to get the
boilerplate included in the first place, when you are not using SPLIT. It would
be deleted after processing.
During SPLIT processing, all tags except SPLIT are ignored. They are just
treated as ordinary text. They can be expanded/refreshed later with a REFRESH
run. The SPLIT process leaves behind a line like this in each file which is
useful in expansion of tags:
You can manually edit the INFO tag generated by SPLIT or insert it manually.
This enables you to use the REFRESH utility without ever using the SPLIT utility.
THIS, MOTHER and ICON not in boiler plate text, are expanded once and cannot be
refreshed since there is no tag left behind. If you want them refreshable, they
must appear in boilerplate text, enclosed in REFRESH tags. In a pinch you could
create a boilerplate file to include that had almost nothing in it but a THIS or
MOTHER tag. It will be expanded in the context of the file where it finally
appears.
Your head is probably hurting by now. The basic problem is how to run REFRESH
multiple times. You want to get rid of what was included/expanded earlier,
before you re-expand the tags. You need a way of identifying where earlier
expansion material started and ended. You also must leave embedded notes around
about how to re-expand. For your first cut, only worry about running REFRESH
once, and leave no trace of the tags. Once you have that working you may feel
ready to tackle the problem of running REFRESH multiple times to freshen your
boilerplate expansions.
Dealing With Broken Links
- The process generates a global script for Funduc
Search and Replace. You run this script and it fixes all the <a href="FILE#NAME">…</a>
references, not only in the files you have just split, but in other related
files that used to refer to spots in your old giant mother document. If you left
out this step, the dangling links would all just take you to the top of the new
mother document.
- Better still it would produce an old:new file for the HTML
Disturbed Link Patcher utility.
- On your first cut, you might leave this out, and just search and replace all
references to the mother document or internal references manually. The catch is,
you are bound to make errors if you do it manually, unless there are just a
handful.
Manual Touchup
You as, end user, do some further manual polishing, e.g. patching <dl>…</dl>
broken in the middle, adding more menu items so that the menu may point to
several different places in one of your fragment files, adjusting the menu to a
perfect square by making some elements multiple columns, and reordering the menu
items. Most other polishings can be done manually with Funduc
Search and Replace. I clean up most of the loose ends with
HTML
Validator, in a bulk validation pass of newly created files.
AutoGlue
If you are looking for still another challenge after you finish that, consider
writing the reverse, a utility to temporarily glue a set of files together in
one big file, so that you can easily globally edit the text using a tool like SlickEdit.
The gluing process inserts magic tags so that after the edit is finished, you
can quickly split them back up again. You could also use this tool to reorder a
giant document.
I wrote a pair of split/glue utilities like this to help me manage Pascal source
code on the PDP 11 many years ago. Except for fixing NAME links, this project is
actually easier to code that to explain.
Implementation
SPLIT needs to be done logically in two passes. You might use a parser
or since grammar is so simple, you could write your own. You could simplify
things by insisting on standard ordering of parameters. It finds the tags create
a list of objects that represent the SPLIT tags it found. You must run through
this list creating a list without duplicates of all the files you will be
creating, so that will know how to interpret the TO=ALL parameter. You could
then proceed to produce all the files simultaneously in one pass, or one file at
a time. The first method is faster; the second more parsimonious of RAM. You
would not need to reparse no matter which approach you used. Everything you need
about filenames and offsets would be captured on the first pass. You need to
create a new mother file with a new temporary name. When you are done, rename it
back. There is no recursive processing of files brought in from the outside. You
just include the text. You don’t parse the included text for for commands.
If the user wants them processed, he can run refresh again.
REFRESH can be handled in a single pass. When you find an INFO tag you remember
the information for use in later tag expansion. You discard the INFO tag itself.
When you find an INCLUDE tag, you copy in data from the FROM file, and
recursively process it for embedded tags, including more INCLUDES. The only file
you change is the one you are processing. You don’t refresh or expand the
included boilerplate files themselves. You discard the INCLUDE tag itself. If
you see a BEGIN REFRESH, you discard up to and including the corresponding
END REFRESH. You may discard some nested BEGIN END pairs in the process! Then
you treat like an INCLUDE. Don’t automatically generate new BEGIN END
REFRESH pairs. If they are wanted, they will be inside the included text. If you
see a THIS, ICON or MOTHER, replace it from the information in the most recent
INFO tag. Discard the tag itself. This way refreshable boilerplate can be
composed of refreshable boilerplate. When your refresh your documents, the
latest and greatest will be recursively refreshed.
As a first cut, you might use Funduc
Search and Replace as your scanning and replacing engine. The key is the *[]
regular expression marker that will match anything, e.g. the old boilerplate
sandwiched between two markers, and the binary replace mode that lets you insert
arbitrary multiline text.
Special Purpose Solutions
Inserting all the tags to control the split is a lot of work. You could alse
invent splits that were clever, splitting on <h2> tags for example and
taking the title of the new fragment from the heading, and the file from the old
anchor. The new filename could be massaged from the current filename and the
anchor. While you are at it, you can squirt out a Funduc search and replace
script of old and new anchors so you can fix up your entire site for links into
the middle of the original big document. I used a little one shot program of
that sort to split ects.html into 142 fragments.
Outstanding Questions
Just what is legal syntax? E.g. what if you find an INFO tag in some boilerplate.
What if REFRESH tags are not balanced? What if you specify both ALL and NONE on
a TO option? What if you specified FROM=here or FROM="here" or FROM="HERE"
Is that a file or the HERE option?
Implementations
I have written an implementation of the boilerplate refresher problem using the
general purpose htmlmacros package. An include looks like this:
<!-- macro Include e:\mindprod\jgloss\include\example.html --<
In using it, I discovered that included text often needs some minor adjustments,
e.g. changing image/red.gif to ../image/red.gif. I will work on a smarter
include that adjusts depending on the directory of the file in which it is being
included, and perhaps a general parm replacement mechanism than does not require
you to compile new macros
I have also done a some specific solutions to the splitter problem — to
split up large glossaries, making each <dt> its own file.