Next: Match Structures, Up: Regular Expressions [Contents][Index]
By default, Guile supports POSIX extended regular expressions. That means that the characters ‘(’, ‘)’, ‘+’ and ‘?’ are special, and must be escaped if you wish to match the literal characters and there is no support for “non-greedy” variants of ‘*’, ‘+’ or ‘?’.
This regular expression interface was modeled after that implemented by SCSH, the Scheme Shell. It is intended to be upwardly compatible with SCSH regular expressions.
Zero bytes (#\nul
) cannot be used in regex patterns or input
strings, since the underlying C functions treat that as the end of
string. If there’s a zero byte an error is thrown.
Internally, patterns and input strings are converted to the current locale’s encoding, and then passed to the C library’s regular expression routines (see Regular Expressions in The GNU C Library Reference Manual). The returned match structures always point to characters in the strings, not to individual bytes, even in the case of multi-byte encodings.
Compile the string pattern into a regular expression and compare it with str. The optional numeric argument start specifies the position of str at which to begin matching.
string-match
returns a match structure which
describes what, if anything, was matched by the regular
expression. See Match Structures. If str does not match
pattern at all, string-match
returns #f
.
Two examples of a match follow. In the first example, the pattern matches the four digits in the match string. In the second, the pattern matches nothing.
(string-match "[0-9][0-9][0-9][0-9]" "blah2002") ⇒ #("blah2002" (4 . 8)) (string-match "[A-Za-z]" "123456") ⇒ #f
Each time string-match
is called, it must compile its
pattern argument into a regular expression structure. This
operation is expensive, which makes string-match
inefficient if
the same regular expression is used several times (for example, in a
loop). For better performance, you can compile a regular expression in
advance and then match strings against the compiled regexp.
Compile the regular expression described by pat, and
return the compiled regexp structure. If pat does not
describe a legal regular expression, make-regexp
throws
a regular-expression-syntax
error.
The flag arguments change the behavior of the compiled regular expression. The following values may be supplied:
Consider uppercase and lowercase letters to be the same when matching.
If a newline appears in the target string, then permit the ‘^’ and ‘$’ operators to match immediately after or immediately before the newline, respectively. Also, the ‘.’ and ‘[^...]’ operators will never match a newline character. The intent of this flag is to treat the target string as a buffer containing many lines of text, and the regular expression as a pattern that may match a single one of those lines.
Compile a basic (“obsolete”) regexp instead of the extended (“modern”) regexps that are the default. Basic regexps do not consider ‘|’, ‘+’ or ‘?’ to be special characters, and require the ‘{...}’ and ‘(...)’ metacharacters to be backslash-escaped (see Backslash Escapes). There are several other differences between basic and extended regular expressions, but these are the most significant.
Compile an extended regular expression rather than a basic
regexp. This is the default behavior; this flag will not
usually be needed. If a call to make-regexp
includes
both regexp/basic
and regexp/extended
flags, the
one which comes last will override the earlier one.
Match the compiled regular expression rx against
str
. If the optional integer start argument is
provided, begin matching from that position in the string.
Return a match structure describing the results of the match,
or #f
if no match could be found.
The flags argument changes the matching behavior. The following
flag values may be supplied, use logior
(see Bitwise Operations) to combine them,
Consider that the start offset into str is not the beginning of a line and should not match operator ‘^’.
If rx was created with the regexp/newline
option above,
‘^’ will still match after a newline in str.
Consider that the end of str is not the end of a line and should not match operator ‘$’.
If rx was created with the regexp/newline
option above,
‘$’ will still match before a newline in str.
;; Regexp to match uppercase letters (define r (make-regexp "[A-Z]*")) ;; Regexp to match letters, ignoring case (define ri (make-regexp "[A-Z]*" regexp/icase)) ;; Search for bob using regexp r (match:substring (regexp-exec r "bob")) ⇒ "" ; no match ;; Search for bob using regexp ri (match:substring (regexp-exec ri "Bob")) ⇒ "Bob" ; matched case insensitive
Return #t
if obj is a compiled regular expression,
or #f
otherwise.
Return a list of match structures which are the non-overlapping
matches of regexp in str. regexp can be either a
pattern string or a compiled regexp. The flags argument is as
per regexp-exec
above.
(map match:substring (list-matches "[a-z]+" "abc 42 def 78")) ⇒ ("abc" "def")
Apply proc to the non-overlapping matches of regexp in
str, to build a result. regexp can be either a pattern
string or a compiled regexp. The flags argument is as per
regexp-exec
above.
proc is called as (proc match prev)
where
match is a match structure and prev is the previous return
from proc. For the first call prev is the given
init parameter. fold-matches
returns the final value
from proc.
For example to count matches,
(fold-matches "[a-z][0-9]" "abc x1 def y2" 0 (lambda (match count) (1+ count))) ⇒ 2
Regular expressions are commonly used to find patterns in one string and replace them with the contents of another string. The following functions are convenient ways to do this.
Write to port selected parts of the match structure match.
Or if port is #f
then form a string from those parts and
return that.
Each item specifies a part to be written, and may be one of the following,
match:substring
). Zero is the entire match.
match:prefix
).
match:suffix
).
For example, changing a match and retaining the text before and after,
(regexp-substitute #f (string-match "[0-9]+" "number 25 is good") 'pre "37" 'post) ⇒ "number 37 is good"
Or matching a YYYYMMDD format date such as ‘20020828’ and re-ordering and hyphenating the fields.
(define date-regex "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") (define s "Date 20020429 12am.") (regexp-substitute #f (string-match date-regex s) 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") ⇒ "Date 04-29-2002 12am. (20020429)"
Write to port selected parts of matches of regexp in
target. If port is #f
then form a string from
those parts and return that. regexp can be a string or a
compiled regex.
This is similar to regexp-substitute
, but allows global
substitutions on target. Each item behaves as per
regexp-substitute
, with the following differences,
(item match)
with the match
structure for the regexp match, it should return a string to be
written to port.
regexp-substitute/global
to recurse on the unmatched
portion of target.
This must be supplied to perform a global search and replace on
target; without it regexp-substitute/global
returns after
a single match and output.
For example, to collapse runs of tabs and spaces to a single hyphen each,
(regexp-substitute/global #f "[ \t]+" "this is the text" 'pre "-" 'post) ⇒ "this-is-the-text"
Or using a function to reverse the letters in each word,
(regexp-substitute/global #f "[a-z]+" "to do and not-do" 'pre (lambda (m) (string-reverse (match:substring m))) 'post) ⇒ "ot od dna ton-od"
Without the post
symbol, just one regexp match is made. For
example the following is the date example from
regexp-substitute
above, without the need for the separate
string-match
call.
(define date-regex "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") (define s "Date 20020429 12am.") (regexp-substitute/global #f date-regex s 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") ⇒ "Date 04-29-2002 12am. (20020429)"
Next: Match Structures, Up: Regular Expressions [Contents][Index]