regex : Java Glossary
home R words local find no local find frame, full screen Google search web for topic jump to footer translate with Babelfish 2008-04-28 by Roedy Green ©1996-2008 Canadian Mind Products
Go to : punctuation 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)
Regex  regex
regular expression: a system of pattern masks to describe strings to search for, sort of a more generalised wildcarding. You can use them to scan text to extract data embedded in boilerplate. You can use them to replace boilerplate patterns.
Introduction Examples
Other Regex Engines Negative Regex
Quoting, why you need \\\\ Matching vs Finding vs LookingAt
Recipes for Quoting Splitting
Regex Variations Table Tips
Multiple Characters String
Awkward Characters Books
Terminology Learning More
Pattern Flags Links

Introduction

JDK 1.4 introduces the java.util.regex package. If they don’t work, use Wassup to check out the version of Java you are using. You may be inadvertently using an old one. Perl-like Regex expressions are compiled into a Pattern (parsed into an internal state machine format, not byte code). You don’t use a constructor to create a Pattern; you use the static method Pattern.compile(String). Then you create a Matcher object with Pattern. matcher(String) feeding it the String you wonder if matches the pattern. Finally, you call Matcher. matches to see if the xfString fits the pattern. There are many other things you can do, for example, to find multiple matches in your String.

Regex cannot do tasks like look for balanced ( ) or deal with a simple precedence grammar. For that you need a parser.

Other Regex Engines

Daniel Savarese has written a second Reex package based on Perl regexes. Look at the Apache Jakarta project. IBM Alphaworks has one. Search for regex. Jakarta-ORO (née OROMatcher), lets you add regex ability to your own Java programs. Funduc Search and Replace is a utility for doing global search and replace on files using regular expressions. The Quoter Amanuensis helps you compose regex expressions for Funduc Search and Replace. SlickEdit® is a text editor that has supports several kinds of regular expressions for global search and replace. Forté Agent newsreader has a regular expression scheme for decribing junk filters. However, it is completely unlike Java regex. It is more like Google search expressions.

Quoting

Reserved characters, aka meta characters are command characters that have special meaning in regexs must be quoted when you mean them literally, as just characters. This does not mean you must enclose them in quotation marks, but rather you must specially mark them as meant literally by preceding them with a \. e.g. \- \+ \?. If you are unsure, quote. It won’t hurt to quote punctuation that does not need it. However, Don’t quote : in Vslick since \:… has special meaning.

Unfortunately, the regex people used the same quoting character \ as the designers of Java did for String literals. In a non-regex Java String literal, every literal \ must be doubled. In a regex every literal \ must be doubled. So when you express a regex as a Java String literal, every literal \ must be quadrupled! and written as \\\\.

When you compose a regex String on the fly, character by character, then Java String literal quoting iis no longer at play. There you merely need double each \. Be especially careful with File.fileSeparatorChar in composed on the fly regexes. If it is \ it must be doubled.

Java 1.4.1+ also offers \Q\E quoting long passages without having to quote command characters individually. You still have to quote for String literals though.

The quoter amanuensis will let you compose your literal regex strings then convert them to deal with both regex and Java \\ quoting.

In JDK 1.5+, Pattern.quote will do the same thing the quoter amanuensis does to a String to give you the equivalent regex, properly quoted to match it literally. It just mindlessly sandwiches the string in \Q … \E, whether it needs it or not.

Again, it won’t hurt to quote punctuation that doesn’t need it. Note that " and ' don’t need regex quoting, though they need Java quoting.

Recipes for Quoting Awkward Characters in Java Regexes in Java source code String Literals

How to Write Awkward Characters Literally in Java Regex String Literals
Character name Character Java literal Regex Java literal + Regex
left bracket,
acting as a regex command character
[ "[" [ "["
left bracket,
reserved regex command character
acting as a literal [
[ "[" \[ "\\["
A literal newline character "\n" \n "\\n"
A literal carriage return character "\r" \r "\\r"
A literal double quote character,
magic to Java, nothing special to regex.
" "\"" " "\""
A literal single quote character,
magic to Java, nothing special to regex.
' "\'" ' "\'"
A literal backslash character,
magic to both Java and regex.
\ "\\" \\ "\\\\"

Regex Variations Table

I use three different regex engines many times a day. I have a heck of a time remembering which commands work with which one. So I composed this table. Lucky I don’t need Perl too.
Regex Variations Table
Use Java 1.4+ SlickEdit®
Unix
Funduc SR Function
Use Java 1.4+ SlickEdit
Unix
Funduc SR Function
search
reserved
$ ( ) * + - . ? [ \ ] ^ { | } * + . ? \ { | } ! $ ( ) * + - ? [ \ ] ^ | Reserved metachars in search strings must must be \-quoted if used as data chars. e.g. \+ \* \|. If in doubt, quote. It won't hurt.
replace
reserved
\ \ % < > \ Reserved metachars in replace strings must must be \-quoted if used as data chars. e.g. \% \\ \< \> If in doubt, quote. It won't hurt.
0+ * * * Zero or More of the preceding thing. .* matches anything. In Funduc, the * comes before the thing repeated, e.g. *[] to match anything even over multiple lines. In Java and SlickEdit, the * comes after, e.g. [a-z]*.
1+ + + + One or More of the preceding thing. In Funduc, the + comes before the thing repeated, e.g. +[0-9\,\.\+\-] to crudely match a number. In Java and SlickEdit, the + comes after, e.g. [0-9\,\.\+\-]+.
1 {1} {1} default Exactly One of the preceding things, similarly for any {n}.
Here is a cute trick to use this Java feature to count characters, inserting a dash between pairs of characters:
// insert a dash between chars
String cute = "AA54BG4G3G".replaceAll( "(\\w{2})(?!$)", "$1:" );
// cute is "AA-54-BG-4G-3G"
0 or 1 ? ?   Zero or One of the preceding thing. e.g (abc)? will match"" or "abc"
not char ^ ~ ?!() Not character operator, e.g. In Java, [^abc] means anything but a, b or c. In other contexts ^ means start of line. In VSlick [~abc] means the same. In Funduc works only on expressions. xref?!(=) finds the letters xref followed by anything but =
not exp (?!X) ~ ! Not expression operator. In java anything but X, via zero width negative lookahead. After the non-match, you continue where you left off, not at the end of the non-matching string. In Funduc xref?!(=) finds the letters xref followed by anything but =
or | | | infix or Operator, (cat|dog) matches cat or dog.
any . . ? any char but newline. To make newline also match dot, in Java, embed (?s). (?s) does not match anything, it just switches mode. You can also turn the mode on with a Pattern.compile flag DOTALL.
nl \r\n \n \r\n newline, given for Windows.
sol ^ ^ ^ Start of Line. In other contexts means not.
eol $ $ $ End of Line. For Windows, matchs a pair of characters \r\n. For Linux matches \n. For Mac matches \r.
sof     ^^ Start of File
eof     $$ End of File
range [] [] [] Range Operator, list of chars,[ab] means match a or b. [a-z] matches any character in range a through z. [0-9] is a digit. [a-z] is lower case. [A-Z] is upper case. [ -_] (space dash underscore) is any printable ASCII char.
In Funduc, you don't need parenthesis around [a-z] in the search string.
negation [^, ] [~, ] n/a any character except a comma or space
intersection [a-z&&[^bc]] n/a n/a a through z, except for b and c
sub () () () Sub-Expression.
In Funduc, you don't need parenthesis around [a-z] in the search string.
col     +n Column Operator
replace $1 \1
\2 etc.
%1
%2 etc.
%1< (to lower case)
back reference to tagged expression #1, in () for replace.
E.g. in SlickEdit to replace all occurrences of
<span class="jmethod">
used before an upper case name, converting them to
<span class="jclass"> .
Search string : <span class="jmethod">([A-Z])
Replace string : <span class="jclass">\1
Remember to turn exact case matching on for these to work.
In Funduc, you don't need parenthesis around [a-z] in the search string.

Java regex has only very primitive replace ability. Every match must be replaced by the same string. However, in Java you can also use \1 in the match string to insist on a match for some expression found earlier in the string, i.e. a repeated pattern.

replace
example
search: \(([a-zA-z\(\"])
replace: \( \1
search: \(([a-zA-z\(\"])
replace: \( \1
search: \([a-zA-z\(\"]
replace: \( %1
Replace all (x with ( x but only if x is alphabetic or ( or "
space \s [ \t\n] [ \t\r\n] single white space
spaces \s+ \:b +[ \t\r\n] one or more white spaces, [ \t\n\x0B\f\r] Watch out, matches line end as well!
black \S [^ \t\n] [! \t\r\n] single non white space (blank, tab)
blacks \S+ [^ \n\t]+ +[! \r\n\t] one or more non-white spaces
word (\p{Alpha}+) \:w +[A-Za-z] alphabetic word (string of A-Z a-z )
number ([0-9\,\.\+\-]+) ([0-9\,\.\+\-]+) +[0-9\,\.\+\-] number (string of digits, commas, decimal points and signs)
quoted \"(\\\"|([ A-Za-z\'\[\]\+\=\!\@\#
\$\%\^\&\*\(\)
\<\>\:\;\?\|\\]*))\"
\:q \"(\\\"(*[ A-Za-z\'\[\]\+\=\!\@\#
\$\%\^\&\*\(\)
\<\>\:\;\?\|\\]*))\"
quoted String. It easier just to quote all punctuation sometimes. It is easier to proofread. Don’t quote : in Vslick since \:… has special meaning.
special \d = digit
\D = non digit
\s = single whitespace char
\S = not whitespace
\w = single alphanumeric char
\W not alphanumeric
\p{Lower}
\p{Upper}
\p{ASCII}
\p{Alpha}
\p{Digit}
\p{Alnum}
\p{Punct}
\p{Graph}
\p{Print}
\p{Blank}
\p{Cntrl}
\p{XDigit}
\p{Space}
\p{Lu}
\p{InGreek}
\p{Sc}
\P{InGreek}
(?i) turn on case insensitive mode
(?-i) turn on case sensitive mode
\:a alphanumeric char
= [A-Za-z0-9]

\:b blanks
= ([ \t]+)

\:c alpha char
= [A-Za-z]

\:d digit
= [0-9]

\:f filename part

\:h hex
= ([0-9A-Fa-f]+)

\:i int
= ([0-9]+)

\:n float

\:p path

\:q quoted string

\:v C language variable name
= ([A-Za-z_$][A-Za-z0-9_$]*)

\:w word
= ([A-Za-z]+)

  predefined match strings, e.g. \:w = ([A-Za-z]+) matches a word. Those are braces in \p{Alnum} not parentheses. It can be hard to tell in some typefaces. The strings are case sensitive, and when used in Java source code such strings must be coded as "\\p{Alnum}". Typically these abbreviations are not designed to work inside […].
capture X{n,m}
capturing
/
non-capturing
constructs
  %%srpath%%
%%srfile%%
%%srfiledate%%
%%srfiletime%%
%%srfilesize%%
%%srdate%%
%%srtime%%
%%envvar=fruit%%
X{n,m} means X appears exactly n to m times.
This table only covers the most common magic characters. See the documentation for each Regex package for details.

Multiple Characters

Multiples in Java Regex
[A-Z] A single upper-case letter
[A-Z]* zero or more upper-case letters
[A-Z]+ one or more upper-case letters
[A-Z][A-Z] Exactly two upper-case letters
[A-Z]{2} Exactly two upper-case letters (same as above)
[A-Z]{2,} Two or more upper-case letters
[A-Z]{2,10} Between 2 and 10 (inclusive) upper-case letters
[a-zA-Z] A single letter, upper- or lower-case

Awkward Characters

Here is how to represent various awkward characters. They represent the combined quoting needs for Java String literals and Regex Patterns.
How To Encode Awkward Characters
How Desired
\\\\ \ The literal backslash character. You must double the \ twice since \ is the quoting character in both Java and Regex literals.
\\xhh The character with hexadecimal value 0xhh, e.g. \\xff. Only works with two hex digits!
\uhhhh The character with hexadecimal value 0xhhhh, e.g. \u20ac. Must always have exactly four hex digits. Don’t use for control characters e.g. 0..ff since \u expansion happens prior to compilation. In other words \u000a will start a new line in your program. Note there is only one lead \.
\\t The tab character \u0009
\\n The newline (line feed) character \u000a
\\r The carriage-return character \u000d
\\f The form-feed character \u000c
\\a The alert (bell) character \u0007
\\e The escape character \u001b
\\cx control characters, e.g. \\cq for ctrl-q.
\\- Literal -, not a regex range operator.
\\+ Literal +, not a regex operator.
\\* Literal *, not a regex operator.
\\? Literal ?, not a regex operator.
\\( Literal (, not a regex expression bracketer.
\\) Literal ), not a regex expression bracketer.
\\[ Literal [, not a regex expression bracketer.
\\] Literal ], not a regex expression bracketer.
\\{ Literal {, not a regex expression bracketer.
\\} Literal }, not a regex expression bracketer.
\\| Literal |, not a regex operator.
\\$ Literal $, not a regex end of line.
\\^ Literal ^, not regex operator.
\\< Literal <, not regex operator.
\\= Literal =, not regex operator.

Terminology

Pattern.CASE_INSENSITIVE is a flag you can feed to Pattern.compile to do case insensitive searches. This is much easier than trying to do them directly in the regex strings.

Java 1.4.1+ regexes have assertions, extra conditions placed on the match. Colourful regex terminology includes:

The easiest way to understand these terms is to experiment with the various regex operators on simple strings. You can make yourself a test program that reads strings from the console. That way, at least you can avoid having to deal with Java \ string quoting. You only need concern yourself with regex \ quoting. You can also use the Quoter Amanuensis to first apply regex quoting then Java string quoting and let you paste the result into your program.

Pattern Flags

You can specifify flags to Pattern.compile( String regex, int flags) with:

By default regexes are case-sensitive.

Possible Pattern flags
Flag Alternate Embedded Code Notes
CASE_INSENSITIVE (?i) Makes case does not matter on matching, s matches S.
MULTILINE (?m) Make ^ and $ match embedded newlines. You might expect embedded newlines to match by default, but they don’t.
DOTALL (?s) Makes . match any character, including a line terminator. By default . does not match line terminators.
UNICODE_CASE (?u) Used in conjunction with CASE_INSENSITIVE to use the elaborate code-folding schemes to compare Unicode upper and lower case. By default, the presumption is all characters being matched are US-ASCI.
CANON_EQ   Treats canonically accented characters done with single char or with a pair as equivalent e.g . å : the pair "a\u030A" is the treated the same as the single character "\u00E5".
UNIX_LINES (?d) \n is recognised in ^ and $ processing.
LITERAL \Q\E Treat all characters as ordinary literals rather than as commands. You don’t then quote with \.
COMMENTS ?x Makes whitespace ignored, and allows embedded comments starting with # that are ignored until the end of a line.

Examples

The following examples use the Java conventions. For use on the command line, undouble the \\.

Negative Regex

Matching vs Finding vs LookingAt

Splitting

Regexes can be used to break phrases into individual words. Here is an example:
Beware, split treats leading, embedded and trailing separators differently. It ignores trailing separators unless you use split ( string, -1 /* limit */ ). It inherited this oddity from Perl.

Tips

String

The String class borrows some convenience regex methods, such as split, matches, replaceAll and replaceFirst. Normally you would use the more efficient java.util.regex methods such as Matcher.replaceFirst and Matcher.replaceAll where you precompile your Pattern and reuse it. The String versions are for one-shot use where efficiency is not a concern. Note that String. replace does not use regexes.

Books

book_cover recommend book⇒Mastering Regular Expressions, Powerful Techniques for Perl and Other Tools, Second Edition
 paperback
ISBN10:0-596-00289-0
ISBN13:978-0-596-00289-3
publisher:O’Reilly recommended
published:2002-07-15
by:Jeffrey E. Friedl, Andy Oram
The Owl Book. Includes scripting languages such as Perl, Tcl, auk and Python. Does not specifically cover Java, though Java regexes were modeled on Perl. More a book for regex experts to hone their skills than a newbie to learn regexes. It is a good place to find regex solutions to standard problems. While it isn’t made up in cookbook style, the examples are usually real-life problems that can be put into practical use.
Canadian flag amazon.ca. amazon.com. American flag
Canadian flag chapters.indigo.ca . powells.com American flag
French flag amazon.fr. barnesandnoble.com American flag
German flag amazon.de. download O’Reilly Safari American flag
UK flag amazon.co.uk.   
book_cover recommend book⇒Regular Expression Pocket Reference
 paperback
ISBN10:0-596-00415-X
ISBN13:978-0-596-00415-6
publisher:O’Reilly recommended
published:2003-05
by:Tony Stubblebine
The Owl Cheat Sheet. Pocket reference companion to Mastering Regular Expressions which also has a owl on the cover.
Canadian flag amazon.ca. amazon.com. American flag
Canadian flag chapters.indigo.ca . powells.com American flag
French flag amazon.fr. barnesandnoble.com American flag
German flag amazon.de. download O’Reilly Safari American flag
UK flag amazon.co.uk.   

Learning More

Sun’s Javadoc on the Regex Package class : available:
Sun’s Javadoc on the Pattern class : available:
Sun’s Javadoc on the Matcher class : available:
Sun’s Javadoc on String.matches : available:
Sun’s Javadoc on String.replaceAll : available:
Sun’s Javadoc on String.replaceFirst : available:
Sun’s Javadoc on String.split : available:

Slick Edit documentation available from Help | contents ⇒ Search and Replace ⇒ Regular Expressions ⇒ Unix Regular Expressions.

Funduc search and replace documentation is available from Help ⇒ contents ⇒ Regular Expressions | Search Operators.

4NT documentation is available from help | contents ⇒ wildcards ⇒ advanced wildcards


CMP_homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.17] The information on this page is for non-military use only.
You are visitor number 105,143. Military use includes use by defence contractors.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/Mindprod website mirror)
http://mindprod.com/jgloss/regex.html J:\mindprod\jgloss\regex.html