The Perl Tutorial: What's Perl?(7)

The Perl Tutorial: What's Perl?(7)

Pattern Matching

Pattern matching involves searching for a sequence of characters within a character string. When carrying out pattern matching, if a pattern is found then a match is said to have occurred.

Perl uses three main functions for pattern matching (although pattern matching can be used in other functions such as the split() function). They are the m//, s///, and tr///.

The m// operator is the match operator. This operator will let us know if a match was found, and the syntax for using this operator is:

m/PATTERN/OPTIONS

PATTERN refers to the character sequence we're searching for, and OPTIONS are alternative selections that can be made. When using the match operator, you can omit these if you're using a forward slash. Or, if you don't wish to use a forward slash, you can substitute another character. But remember: you must use slashes to direct Perl to use the match operator. If you do use a pattern delimeter that's normally a special-pattern character, then you won't be able to use that special-pattern character within the pattern you specify for matching. Here's an example:

m!PATTERN!OPTIONS

The match operator has certain options that can be used:

OPTION DESCRIPTION g Match all possible patterns i Ignore case m Test string as multiple lines o Only evaluate once s Treat string as single line x Ignore white space in pattern

The s/// operator is known as the Substitution operator, the syntax for this operator is:

s/PATTERN/REPLACEMENT/OPTIONS

PATTERN holds the pattern that we want to search for, and REPLACEMENT holds the value that we want to use as our replacement value when the pattern that we're searching for is found. For example:

s/abc/xyz/

In the above example we identify that we want to search for "abc" in that order, and replace it with "xyz". You can also use Pattern-Sequence variables in substitutions, which will be discussed later. The substitution operator also has options that can be used:

OPTION DESCRIPTION g Change all occurrences of the pattern i Ignore case in pattern e Evaluate replacement strings as expression m Treat string to be matched as multiple lines o Evaluate only once s Treat string to be matched as single line x Ignore white space in pattern

Finally, the Translation operator provides us with another method to substitute one group of characters for another. The translation operator syntax is:

tr/STRING1/STRING2/OPTIONS

Here STRING1 contains a list of characters to be replacecd, and STRING2 contains the characters that replace them. The first character in STRING1 is replaced by the first character in STRING2, the second character by the corresponding number in STRING2, and so on.

tr/abc/def/

In the above example, abc is STRING1. a is being replaced by d, b is being replaced by e, c is being replaced f. If you wanted to convert all the characters from uppercase to lowercase you would use:

tr/A-Z/a-z/;

As you can see, the range operator is supported in the pattern matching operations. Once again the translation operator also has options which can be used:

OPTION DESCRIPTION c Translate all characters not specified d Delete all specified characters s Replace multiple identical output characters with a single character

Remember that if you are using the slash operator and you pattern contains a forward slash also, then you must escape it using the escape character "\\".

Now let's see how we build patterns:

When doing pattern matching, the pattern being sought is by default being looked at using the contents of the default variable ($_). To use a different variable, we'd have to use a match operator along with the one of the three functions I mentioned before.

The =~ operator binds the pattern to the string on the left hand side of the operator. This says that the pattern should be searched for in the scalar variable.

$string =~ m/hello/;

As the above example demonstrates, we're searching for the "hello" string in the $string variable. If a pattern is found then true (a non-zero value) is returned, otherwise false (a 0 value) is returned.

The !~ operator binds the pattern to the string on the left hand side of the operator, and will return true when the pattern is not found.

Now let's discuss special characters in patterns:

The + character means "one or more of the preceding". That means if we have a pattern:

m/abc+/

it should return 'true' if a match is found on abc, abcc, abccc, abcccc and so on.

The [ ] characters allow you to define patterns that match a group. This means that whatever is contained in the brackets is treated as a group from which we can take our pick:

m/a[bc]d/

The above pattern says that we will find a match if we find either "abd" or "acd".

Another special character is the * character, which means "zero or more of the preceding". This means that if we have a pattern:

m/ab*/

We will find a match for "a", "ab", "abb", "abbb", and so on.

The ? character means "zero or one occurrence of the preceding". This pattern character works the same way as the * operator, except that the maximum number of characters that are accepted is 1.

You can also anchor patterns using the ^ and $ operators.

The ^ operator anchors a pattern to the beginning of a line:

m/^The/

In the above example we'll find a match if a line in the string starts with "The" (or if the entire string starts with The, when the 's' option is in use). If the ^ character is being used in square brackets, and is at the beginning, it means 'anything not in that group.' So be careful - the meaning of the string changes when it is enclosed in brackets.

The $ operator anchors a pattern at the end of the line, therefore:

m/end$/

In the above example, we'll find a match only if a line in the string ends with "end" (or if the entire string ends with "end" and the 's' option is in use).

Word-boundry pattern anchors specify whether a matched pattern must be on a word boundry, or inside a word boundry. The \\b pattern anchor matches only if the specified pattern is at the beginning or end of a word, while the \\B pattern anchor matches if the specified pattern is inside a word.

When using character ranges within brackets, you can shorten up the process by using special character-range escape sequences:

Sequence Description Range Equiv. \\d Any digit [0-9] \\D Not a digit [^0-9] \\w Any word character [_0-9a-zA-Z] \\W Anything not a word character [^_0-0a-zA-Z] \\s White space [ \\r\\n\\r\\f] \\S Anything other than white space [^ \\r\\t\\n\\f]

If you wanted to match a single character, use the '.' (period) character. This will match any character (except the newline character when the 's' option is not in use).

If you want to match a specified number of occurrences, then use the {x,y} characters after the character for which you want to specify the number of occurrences. Here, x is the minimum and y is the maximum.

To specify alternatives, then use the '|' character.

You can group portions of characters using the ( ) characters. These also allow you to reuse the matched pattern later on. To do this, use \\number, where number is the number of the ( ) in the order in which they were entered. Characters grouped with ( ) may also have operators such as +, *, ? and {x,y} applied to them as a group.

This is not intended to be a complete tutorial in pattern matching, but an introduction to the subject. For more details on pattern matching, try Mastering Regular Expressions by J. Friedl from O'Reilly.