a program that analyses syntax. It might for example look at a piece
of Java source code and find all the variable names, method names and
operators in order to compile it into
JVM (Java Virtual Machine)
byte code, or it might analyse
HTML (Hypertext Markup Language),
or your own invented language. The original LEX/YACC/Bison generated C
code. There are now variants that generate Java code. My personal
favourite, based mainly on the accessible documentation is JavaCC.
People who write parsers have a strange language all their own. The writers of these tools are academics and
are not interested in teaching you anything, just impressing you with
how brilliant their programs are. This means the manuals are almost
useless. You have to study examples, particularly the simple ones and
gradually the manuals will begin to make sense. Another learning
technique is to examine the Java code generated from some sample
grammars. Authors took six years of university courses to get to their
level of parser understanding, why should they make it any easier for
you?
Roughly what happens is you describe your grammar in some Mickey Mouse
syntax. Then a utility converts that into a Java program that will
analyse text conforming to that grammar. I must admit I am shocked at
how ugly the specification languages are. I would have thought they
would be the most beautiful and regular of all languages, being composed
by afficionados of language analysis.
Java has four simple built-in parsers, java.util.StringTokenizer,
java.io.StreamTokenizer, java.text.BreakIterator.
and java.regex.Pattern.
Java version 1.5 or later also has a number of
XML (extensible Markup Language)
parsers built-in. Check out the DOM,
SAX, XSD,
XPath and Schema
entries.
Hand Rolled Parsers
I wrote a number of parsers as part of JDisplay — the tools that
pretties up listings on this website. The problem I faced is the code
had to work with deliberately erroneous code and code
fragments.
The traditional parsers are totally unforgiving. They want perfect,
complete programs or data files to parse and give up totally on the
first hiccough.
So I wrote my own using finite state automata, using enum constants
to represent each state.
Download and have a look at the source.
Limitations
If you are colourising code, or rearranging code, ideally you want
your parser to work even if the code contains syntax errors, or if you
just have a snippet. Traditional parsers only work on syntatically
perfect complete programs. This is why I used finite state automata
instead of ANTLR (Another Tool for Language Recognition) for the parsers I used for colourising on this site.