Regex Proofreader Regex Proofreader
home Student Projects no local find frame, full screen Google search web for topic jump to footer translate with Babelfish by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. It does not describe an actual complete program. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; I have too many other projects of my own.

I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project any way you please.

If you don’t know what regular expressions are, have a look in the Java & Internet Glossary. Basically they are search/replace patterns.

If you have ever used Funduc Search and Replace or SlickEdit or any program using regular expressions, you know what a beast they can be to proofread. You are often quite amazed by what they do to your files when you turn them loose.

The main problem is quoting. Does < mean the literal character < or is it a command? In Funduc Search and Replace, it is a literal in the search argument, but a command in the replace argument. Arrgh! Every implementation of regexes uses a slightly different set of commands and command characters. In Java 1.4, the character \ has when used literally to represent data has to be written as "\\\\" because regex and Java string quoting gang up on you.

If a character is reserved as a command, when you want to use it literally, you must precede it with a \ . This turns expressions into unreadable nightmares like this:

*[ \\\*<\r\nabc]
A very simple regex proofreader simply colour codes each letter as a literal or as a command. For example in Funduc search expressions the following characters need to be preceding with \ when used as literals:
- + * ? ( ) [ ] \ | $ ^ !
and the following characters in replace expressions need to be preceded with \ :
% \ < >
You must not precede any other literals with \ .

You might display command characters in red and literal characters in blue, instead of like this:

*[ \\\*<\r\nabc]
you would see them like this:
*[\\\*<\r\nabc]

A slightly cleverer proofreader would let you hide the \ characters (except those preceded by another \ ) and just display the raw colour-coded literals. You could think of this as unquoting . The unquoted regex expressions would then look much more like the actual strings they are intended to match. The expression may then look like this:

*[\*<rnabc]

What then would you do with tab= \t , cr= \r and newline= \n ? You might cook up some special glyphs to represent them more literally, much the way MS word can be persuaded to make visible the spaces, line and paragraph ends with special symbols, like this:

*[\*<¦abc]
You might just use the letters t , r and n but display them in green.
*[\*<rn abc]
If you can view them that way, why not type them that way? Instead of worrying about which letters need to be quoted, just use ctrl-R to load your "pen" with red ink(commands), ctrl-B to load it with blue ink(literal data), and ctrl-G to load it with green ink(control chars) — or tap one of three coloured inkwells with your mouse, as if dipping an old goose quill pen. Just type your literals naturally and behind the scenes, your Applet will insert the \ characters as needed. If you typed an invalid command in red-command mode, it would just beep at you. You could even use the Enter key to more directly represent \n or the \r\n pair.

Everything so far is pretty easy. You don’t need to know anything much about regex expressions to write such a simple proofreader. The only problem is Java 1.1 does not support rich text — multiple font colours in a TextArea, though Swing does. Unfortunately you can’t handle the colours by generating HTML <FONT COLOR= commands and having something render it. You need to handle your colours at a lower level — like a Canvas.

For a more advanced proofreader, things get a little tougher. The user should be able to submit some sample strings and see what the regex would do to them. Do they match the pattern. What do they get converted to. You have to simulate Funduc Search and Replace, SlickEdit or whatever other Regex engine you are proofing for. If you dig about the web, you may find source code to do this for you that you could cannibalise. You have one advantage over the authors of the commercial regex programs. Java 1.4.1 has Perl-like Regex now part of the java.util.regex package. Your parsing and scanning code can be very slow and no one will mind. The user should be able to maintain little libraries of test strings that he can quickly test out. The before and after views should use colour or bold to highlight the changes.

Once you have that working, you can now try something even tougher, generate a sample set of strings to feed to the regex that exercise each part of the regex expression. Some should match, some should not. By examining the results on that test set of strings, the author of the regex expressions should have a pretty good idea of all the things that regex will do when it is turned loose in the real world on files. To do it very well, your generated sample set of strings should exercise every feature of the regex expression. You should generate strings that pass and fail each filtering point in the regex.

If your regex engine is sufficiently fast, you might consider turning your proofreader into a full blown search/replace engine, that can run scripts of commands or accept them one at a time via a GUI. You might also consider writing your own text editor with this as the crown jewel. For blinding speed, generate byte codes to parse the regex expression. This may then be JITed on the fly, which should give Perl-like speeds. However, the easy way is just to use java.util.regex.

If you want to tackle a simplified version of this project, try creating proofreader for Java source code string literals. So that instead of seeing "C:\\mydir\\myfile.txt" you would see "C:\mydir\myfile.txt" . Or "abcdef" instead of "abc\ndef".

regex
Regex Composer
Regex Debugger
Regex Utility

CMP_homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.16] You need Adobe flash to see this public service ad.
You are visitor number 3,833.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/Mindprod website mirror)
http://mindprod.com/project/regexproofreader.html J:\mindprod\project\regexproofreader.html