|
Article
|
Difficulty
Rating
|
6
|
Text
Management
|
|
There
are many features
available in MetaCard
for doing complex
matches on text. From
the simple
find
and
replace
search routines to the
extraction of key
entries from complex
formatted text, MetaCard
has the features you
need. This article
examines the
matchText,
matchChunk
and
replaceText
functions. It also
discusses splitting a
large file into chunks
suitable for processing,
using an array. For
information about the
more basic search and
replace features see the
article on
finding
and replacing
text.
The matchText
Function
The
matchText
function can be used for
advanced text matching.
Its has two main areas
of functionality. First,
it validates whether
text contains a given
pattern, and second, it
can return certain
patterns within a that
string to specific
variables if it makes a
match. If the text
passed to it is a
multiple-line container,
the validation applies
to the entire container.
For output to variables,
however, it applies only
to one occurence of the
match, so if you want to
repeatedly extract the
same pattern, you need
to apply the
matchText
function for each line.
The syntax is as
follows:
matchText(<source>, <regularExpression>[, <output variable 1>,
<output variable 2>...])
A
basic example of
validation..
matchText("hello there", "hello")
This
will return true since
the string "hello" is a
substring of "hello
there".
matchText(field "My Field", "hello")
Similarly,
if there is a field
called "My Field"
containing the string
"hello", this will
return true.
But
matchText is not really
intended to perform such
basic matching. To
give the exact
characteristics of the
text match
<regularExpression>
can be formed from any
combination of special
characters, each of
which has a specific
function.
The Special
Characters for use with
the matchChunk,
matchText, and
replaceText
functions.
The
following table shows
all the special
characters that can be
used in the
<regularExpression>
pattern, together with
their functions. The
table might be worth
saving for future
reference - don't expect
to remember them all the
symbols first
time!
|
Special
Character
|
Description
|
(exp)
|
matches
the expression, and puts
result in a
variable*
|
.
(a
period)
|
matches
any character
|
^
|
forces
match to be at beginning
of string
|
$
|
forces
match to be at end of
string
|
[chars]
|
matches
any of the characters in
the set of chars. The
characters can be either
characters allowed to
match, or if ^ is the
first character in
chars, not allowed to
match. You can specify a
range of characters by
putting a - between
them. For example
[a-z] matches
any lower case
alphabetic
character.
|
*
|
matches
zero or more of the
preceding special
character
|
+
|
matches
one or more of the
preceding special
character
|
?
|
matches
zero or more of the same
characters matched by
the previous special
character
|
regEx
1 | regEx
2
|
matches
either regular
expression
|
*It is
important to remember that
everything enclosed within
parentheses in this expression
denotes an output to a variable.
The first set of parentheses is
output to the first variable
<output 1> and the second
set of paretheses is output to
the second variable <output
2> and so on.
Example 1:
matchText with Email
Addresses
IMPORTANT:
Unlike commands such as
the
put
command, the
matchText
function does not
automatically declare
variables. Therefore,
remember to declare the
variables prior to
usage.
|
local actualName, addressName, ispName, classification
put "From: Joe Bloggs <jbloggs@someISP.com>" into source
put matchText(source, "^From: (.*) <(.+)@([^\.]+)\.(.+)>",\
actualName, addressName,ispName, classification)
The
<regularExpression>
component in the above
pattern is broken up in
the table
below.
|
Component
of String
|
Match
in Source
|
Description
of Special
Chars
|
^
From:
|
From:
|
the ^
character forces the
string "From:" to match
the beginning of the
<source>
expression
|
(.*)
|
Joe
Bloggs
|
the
() enclose an output to
the variable
actualName
the .
(period) character
matches any character
and the * character
allows zero or more
unspecified subsequent
characters before the
next specified character
(in this case < is
next
specified)
|
<
|
<
|
a
literal match of the
<
character
|
(.+)
|
jbloggs
|
an
output to the variable
addressName, in
this case matching if
there are any characters
before the @
symbol
Note
the use of + instead of
*. The + requires that
at least one character
be present. E.g.
matchText("", ".+")
returns false, but
matchText("", ".*")
returns true. This
example of
matchText
will match a string with
no name before the
actual email address,
but requires the email
address to be
present.
|
@
|
@
|
a
literal match
|
([^\.]+)
|
someISP
|
the
[ ] encloses a
character
match
the ^
specifies that the
string must not
match the
period
the \
specifies that the . is
escaped since it is a
special character but a
literal match is
wanted
the
() specify the value is
to be output to the
variable
ispName
|
\.
|
.
|
a
literal match of the
period .
|
(.+)
|
com
|
an
output to the variable
classification
|
>
|
>
|
a
literal match of the
>
character
|
Thus
in the above example
output to variables is
as follows:
|
Variable
|
Contents
of
Variable
|
actualName
|
Joe
Bloggs
|
addressName
|
jbloggs
|
ispName
|
someISP
|
classification
|
com
|
Example 2:
matchText with Phone
Numbers
local international, county, district
put "+44131 672 2909" & return & "0131 554 2961" into source
repeat for each line l in source
-- matchText must be applied to each line
if matchText(l, "(^\+[0-9]+|[0-9]+) ([0-9]+) ([0-9]+)", \
international, county, district) then
put "national code:" && international & return after output
put "county code:" && county & return after output
put "district code:" && district & return & return after output
end if
end repeat
answer output
Component
of string
|
Match
in string
1
|
Match
in string
2
|
Description
of special
chars
|
(^\+[0-9]+
|
+44131
|
|
^\+
literally matches the +
character at the start
of the string - the ^
forces the match to be
at the start of the
string, the \ escapes
the +
character
[0-9]+
matches any numeric
characters
|
|
|
|
|
allows
a match of the string
either before or after
it; whichever matches is
put into the variable
international
|
[0-9]+)
|
|
0131
|
matches
any numeric
characters.
|
([0-9]+)
|
672
|
554
|
matches
any numeric
characters
the
matching string is put
into the variable
county
|
([0-9]+)
|
2909
|
2961
|
matches
any numeric
characters
the
matching string is put
into the variable
district
|
TIP:
To match the " (quote)
character use the quote
constant. To make an
actual match with one of
the special characters,
you must first escape
that character by
inserting the \
character before
it.
|
The matchChunk
Function
matchText(<source>, <regularExpression>[, <output1FirstChar>,
<output1LastChar>, output2FirstChar, output2LastChar...])
The
matchChunk
function works in a
similar fashion to the
matchText
function. The same
special characters work,
and an output is made to
variables, but the
matchChunk
function outputs a chunk
expression to describe
the matched text, rather
than the actual text
matched. Output occurs
to pairs of variables.
The first of the pair is
the number of the first
character in the
matching string, the
second is the number of
the last character in
the matching string.
Thus in equivalent
matchChunk
and
matchText
strings the
matchChunk
has twice the number of
output
variables.
Example
3: matchChunk with Email
Addresses
If
Example 1 is modified to
use matchChunk instead
of matchText we have the
following:
|
local actualNameFirst, actualNameLast, addressNameFirst,\
addressNameLast, ispNameFirst, ispNameLast, classificationFirst,\
classificationLast
put "From: Joe Bloggs <jbloggs@someISP.com>" into source
put matchChunk(source, "^From: (.*) <(.+)@([^\.]+)\.(.+)>",\
actualNameFirst, actualNameLast, addressNameFirst,\
addressNameLast, ispNameFirst, ispNameLast, classificationFirst,\
classificationLast)
Now the output
to variables is as
follows:
String
|
Matching
Text
|
Variable
Pair
|
Contents
of Variable
|
(.*)
|
Joe
Bloggs
|
actualNameFirst
actualNameLast
|
7
16
|
(.+)
|
jbloggs
|
addressNameFirst
addressNameLast
|
19
25
|
([^\.]+)
|
someISP
|
ispNameFirst
ispNameLast
|
27
33
|
(.+)
|
com
|
classificationFirst
classificationLast
|
35
37
|
The replaceText
Function
replaceText(<source>, <regularExpression>, <replacement>)
The
replaceText
function can be used to
replace all occurences
of a text string in a
source. All the same
special characters as in
matchText
and
matchChunk
can be used in the
regularExpression,
and the same rules apply
for escaping special
characters. However, for
most uses, the
replace
command is better
because of its speed and
simplicity.
|
Splitting up large amounts of
data to perform text matching
operations
Say you're
importing a database, or some
other large file with a lot of
data in it. What do you do with
it when you've read it in? In
most cases, the first thing to do
before processing anything is to
split it up into manageable
chunks, rather than have it all
sitting in one large variable. If
you want to do any chunk
expressions or text matching on
the data, e.g. searching for
particular strings using
lineOffset(),
you'll find its inefficient to
work with large quantities of
data in one variable. If you're
using
lineOffset()
to pick out all the lines with a
particular string, for example,
you have to be aware that it
*always* starts searching from
the start of the container. Even
if you specify a number of lines
to skip, whilst these lines
aren't searched, they still have
to be read through, so this does
not lead to performance
improvements. By the time you've
got a reasonable way down the
file, you'll be unnecessarily
reading through large amounts of
data at the start of the file
each time you pick out another
line.
Typically,
you'll want to use an array to
split the data up into neat
chunks, each of which can be
accessed as an independent unit.
The time it takes to split the
data up like this is usually
insignificant compared to the
time savings made on processing
the data afterwards.
Lets say that
we're dealing with a return
deliminated file, where each
database entry begins with a line
containing the word "start" and a
name or identifying string for
each entry:
start myDataBaseRecord1
record 1 contents...
record 1 contents...
more lines of contents...
start myDataBaseRecord2
record 2 contents
etc...
The following
script will split up the data,
waiting until the word "start"
appears in a line then placing
the text (up until the next time
it meets the word start) into a
separate array
element:
repeat for each line l in tFileContents
if word 1 of l is "START" then
put l into currentobj
else
put l&cr after gBigArray[currentobj]
end if
end repeat
The first line
tells MetaCard to cycle through
each line in the field contents,
putting each line into the
variable l. An alternative way of
doing this would have been to
use:
repeat with i = 1 to the num of lines in tFileContents
However, the
latter method should only be when
you absolutely need to
know the line number of each line
for some reason. Using the
"repeat for each line..."
construct, as in the first
example, is always
very considerably
faster.
The second
line checks if the line starts
with the word "start". You can
use anything here, e.g. a
delimiter you expect to find in
the text, or even a complex match
on the line (as described earlier
in this article). If the data
you're picking out is fairly
simple and doesn't require more
than simple processing (e.g. it
just gets split up into fields),
you may want to do that here
instead of split the data up into
an array for further processing
at all.
The third line
contains the name of the array
element currently in use. This
gets updated every time the word
"start" is encountered. Because
arrays in MetaCard can be indexed
with strings, there is no need to
condense this line into anything
- the entire line can be used to
reference the array
element.
The fifth line
places the current line in the
variable l into the element of
the array currently in use (named
in currentObj).
The result of
using that repeat loop on the
small set of example data above
would be an array with two
elements.
Element
Name
start myDataBaseRecord1
start myDataBaseRecord2
|
Contents
record contents...
record contents...
more lines of contents
record 2 contents
etc...
|
To process the
data now that its been split up,
we need a list of all the
elements in the array:
put keys(gBigArray) into tListOfElements
The variable
tListOfElements now contains a
complete list of each record in
the array. Processing that is a
simple matter of cycling through
each line and passing the array
element to a function that does
the processing. For
example:
repeat for each line l in tListOfElements
doMyDataProcessingRoutine gBigArray[l]
end repeat
The end
result? To
doMyDataProcessingRoutine would
be sent each chunk of the
original file separated between
lines containing the word
"start". Because you're now
dealing with a smaller chunk of
data, its efficient to do any
intensive processing routines on
each record: e.g. chunk
expressions or regular expression
matching (REGEX) to draw graphs,
display information, build
objects, or whatever you
like...
If you've been
following along closely this far,
you'll have spotted one problem.
The elements passed to the
doMyDataProcessingRoutine won't
be passed in the order they were
in, in the original file. Why?
The keys() function doesn't
return the elements in the order
they were created in the array.
If you need to process the file
in order, then alter the original
script to:
put 1 into tCounter
repeat for each line l in tFileContents
if word 1 of l is "START" then
put l && tCounter into currentobj
add 1 to tCounter
else
put l&cr after gBigArray[currentobj]
end if
end repeat
|
The variable
tCounter increments each time a
new element is created, and
appends the current element
number as a seperate word to the
end of the element name. When
getting the list of array
elements, you need to sort
them:
put keys(gBigArray) into tListOfElements
sort lines of tListOfElements numeric by last word of each
|