Text Management

Article Difficulty Rating

6 Text Management

.

.

.

.

.

There are many features available in MetaCard for doing complex matches on text. From the simple find and replace search routines to the extraction of key entries from complex formatted text, MetaCard has the features you need. This article examines the matchText, matchChunk and replaceText functions. It also discusses splitting a large file into chunks suitable for processing, using an array. For information about the more basic search and replace features see the article on finding and replacing text.

The matchText Function

The matchText function can be used for advanced text matching. Its has two main areas of functionality. First, it validates whether text contains a given pattern, and second, it can return certain patterns within a that string to specific variables if it makes a match. If the text passed to it is a multiple-line container, the validation applies to the entire container. For output to variables, however, it applies only to one occurence of the match, so if you want to repeatedly extract the same pattern, you need to apply the matchText function for each line. The syntax is as follows:

matchText(<source>, <regularExpression>[, <output variable 1>, <output variable 2>...])

A basic example of validation..

matchText("hello there", "hello")

This will return true since the string "hello" is a substring of "hello there".

matchText(field "My Field", "hello")

Similarly, if there is a field called "My Field" containing the string "hello", this will return true.

But matchText is not really intended to perform such basic matching. To give the exact characteristics of the text match <regularExpression> can be formed from any combination of special characters, each of which has a specific function.

The Special Characters for use with the matchChunk, matchText, and replaceText functions.

The following table shows all the special characters that can be used in the <regularExpression> pattern, together with their functions. The table might be worth saving for future reference - don't expect to remember them all the symbols first time!

Special Character

Description

(exp)
matches the expression, and puts result in a variable*

. (a period)
matches any character

^
forces match to be at beginning of string

$
forces match to be at end of string

[chars]
matches any of the characters in the set of chars. The characters can be either characters allowed to match, or if ^ is the first character in chars, not allowed to match. You can specify a range of characters by putting a - between them. For example [a-z] matches any lower case alphabetic character.

*
matches zero or more of the preceding special character

+
matches one or more of the preceding special character

?
matches zero or more of the same characters matched by the previous special character

regEx 1 | regEx 2
matches either regular expression

*It is important to remember that everything enclosed within parentheses in this expression denotes an output to a variable. The first set of parentheses is output to the first variable <output 1> and the second set of paretheses is output to the second variable <output 2> and so on.

Example 1: matchText with Email Addresses

IMPORTANT: Unlike commands such as the put command, the matchText function does not automatically declare variables. Therefore, remember to declare the variables prior to usage.

local actualName, addressName, ispName, classification put "From: Joe Bloggs <jbloggs@someISP.com>" into source put matchText(source, "^From: (.*) <(.+)@([^\.]+)\.(.+)>",\ actualName, addressName,ispName, classification)

The <regularExpression> component in the above pattern is broken up in the table below.

Component of String

Match in Source

Description of Special Chars

^ From:

From:

the ^ character forces the string "From:" to match the beginning of the <source> expression

(.*)

Joe Bloggs

the () enclose an output to the variable actualName

the . (period) character matches any character and the * character allows zero or more unspecified subsequent characters before the next specified character (in this case < is next specified)

<

<

a literal match of the < character

(.+)

jbloggs

an output to the variable addressName, in this case matching if there are any characters before the @ symbol

Note the use of + instead of *. The + requires that at least one character be present. E.g. matchText("", ".+") returns false, but matchText("", ".*") returns true. This example of matchText will match a string with no name before the actual email address, but requires the email address to be present.

@

@

a literal match

([^\.]+)

someISP

the [ ] encloses a character match

the ^ specifies that the string must not match the period

the \ specifies that the . is escaped since it is a special character but a literal match is wanted

the () specify the value is to be output to the variable ispName

\.

.

a literal match of the period .

(.+)

com

an output to the variable classification

>

>

a literal match of the > character

Thus in the above example output to variables is as follows:

Variable

Contents of Variable

actualName

Joe Bloggs

addressName

jbloggs

ispName

someISP

classification

com

Example 2: matchText with Phone Numbers

local international, county, district put "+44131 672 2909" & return & "0131 554 2961" into source repeat for each line l in source -- matchText must be applied to each line if matchText(l, "(^\+[0-9]+|[0-9]+) ([0-9]+) ([0-9]+)", \ international, county, district) then put "national code:" && international & return after output put "county code:" && county & return after output put "district code:" && district & return & return after output end if end repeat answer output

Component of string

Match in string 1

Match in string 2

Description of special chars

(^\+[0-9]+

+44131

^\+ literally matches the + character at the start of the string - the ^ forces the match to be at the start of the string, the \ escapes the + character

[0-9]+ matches any numeric characters

|

allows a match of the string either before or after it; whichever matches is put into the variable international

[0-9]+)

0131

matches any numeric characters.

([0-9]+)

672

554

matches any numeric characters

the matching string is put into the variable county

([0-9]+)

2909

2961

matches any numeric characters

the matching string is put into the variable district

TIP: To match the " (quote) character use the quote constant. To make an actual match with one of the special characters, you must first escape that character by inserting the \ character before it.

The matchChunk Function

matchText(<source>, <regularExpression>[, <output1FirstChar>, <output1LastChar>, output2FirstChar, output2LastChar...])

The matchChunk function works in a similar fashion to the matchText function. The same special characters work, and an output is made to variables, but the matchChunk function outputs a chunk expression to describe the matched text, rather than the actual text matched. Output occurs to pairs of variables. The first of the pair is the number of the first character in the matching string, the second is the number of the last character in the matching string. Thus in equivalent matchChunk and matchText strings the matchChunk has twice the number of output variables.

Example 3: matchChunk with Email Addresses

If Example 1 is modified to use matchChunk instead of matchText we have the following:

local actualNameFirst, actualNameLast, addressNameFirst,\ addressNameLast, ispNameFirst, ispNameLast, classificationFirst,\ classificationLast put "From: Joe Bloggs <jbloggs@someISP.com>" into source put matchChunk(source, "^From: (.*) <(.+)@([^\.]+)\.(.+)>",\ actualNameFirst, actualNameLast, addressNameFirst,\ addressNameLast, ispNameFirst, ispNameLast, classificationFirst,\ classificationLast)

Now the output to variables is as follows:

String

Matching Text

Variable Pair

Contents of Variable

(.*)

Joe Bloggs

actualNameFirst

actualNameLast

7

16

(.+)

jbloggs

addressNameFirst

addressNameLast

19

25

([^\.]+)

someISP

ispNameFirst

ispNameLast

27

33

(.+)

com

classificationFirst

classificationLast

35

37

The replaceText Function

replaceText(<source>, <regularExpression>, <replacement>)

The replaceText function can be used to replace all occurences of a text string in a source. All the same special characters as in matchText and matchChunk can be used in the regularExpression, and the same rules apply for escaping special characters. However, for most uses, the replace command is better because of its speed and simplicity.

Splitting up large amounts of data to perform text matching operations

Say you're importing a database, or some other large file with a lot of data in it. What do you do with it when you've read it in? In most cases, the first thing to do before processing anything is to split it up into manageable chunks, rather than have it all sitting in one large variable. If you want to do any chunk expressions or text matching on the data, e.g. searching for particular strings using lineOffset(), you'll find its inefficient to work with large quantities of data in one variable. If you're using lineOffset() to pick out all the lines with a particular string, for example, you have to be aware that it *always* starts searching from the start of the container. Even if you specify a number of lines to skip, whilst these lines aren't searched, they still have to be read through, so this does not lead to performance improvements. By the time you've got a reasonable way down the file, you'll be unnecessarily reading through large amounts of data at the start of the file each time you pick out another line.

Typically, you'll want to use an array to split the data up into neat chunks, each of which can be accessed as an independent unit. The time it takes to split the data up like this is usually insignificant compared to the time savings made on processing the data afterwards.

Lets say that we're dealing with a return deliminated file, where each database entry begins with a line containing the word "start" and a name or identifying string for each entry:

start myDataBaseRecord1 record 1 contents... record 1 contents... more lines of contents... start myDataBaseRecord2 record 2 contents etc...

The following script will split up the data, waiting until the word "start" appears in a line then placing the text (up until the next time it meets the word start) into a separate array element:

repeat for each line l in tFileContents if word 1 of l is "START" then put l into currentobj else put l&cr after gBigArray[currentobj] end if end repeat

The first line tells MetaCard to cycle through each line in the field contents, putting each line into the variable l. An alternative way of doing this would have been to use:

repeat with i = 1 to the num of lines in tFileContents

However, the latter method should only be when you absolutely need to know the line number of each line for some reason. Using the "repeat for each line..." construct, as in the first example, is always very considerably faster.

The second line checks if the line starts with the word "start". You can use anything here, e.g. a delimiter you expect to find in the text, or even a complex match on the line (as described earlier in this article). If the data you're picking out is fairly simple and doesn't require more than simple processing (e.g. it just gets split up into fields), you may want to do that here instead of split the data up into an array for further processing at all.

The third line contains the name of the array element currently in use. This gets updated every time the word "start" is encountered. Because arrays in MetaCard can be indexed with strings, there is no need to condense this line into anything - the entire line can be used to reference the array element.

The fifth line places the current line in the variable l into the element of the array currently in use (named in currentObj).

The result of using that repeat loop on the small set of example data above would be an array with two elements.

Element Name

start myDataBaseRecord1 start myDataBaseRecord2

Contents

record contents... record contents... more lines of contents record 2 contents etc...

To process the data now that its been split up, we need a list of all the elements in the array:

put keys(gBigArray) into tListOfElements

The variable tListOfElements now contains a complete list of each record in the array. Processing that is a simple matter of cycling through each line and passing the array element to a function that does the processing. For example:

repeat for each line l in tListOfElements doMyDataProcessingRoutine gBigArray[l] end repeat

The end result? To doMyDataProcessingRoutine would be sent each chunk of the original file separated between lines containing the word "start". Because you're now dealing with a smaller chunk of data, its efficient to do any intensive processing routines on each record: e.g. chunk expressions or regular expression matching (REGEX) to draw graphs, display information, build objects, or whatever you like...

If you've been following along closely this far, you'll have spotted one problem. The elements passed to the doMyDataProcessingRoutine won't be passed in the order they were in, in the original file. Why? The keys() function doesn't return the elements in the order they were created in the array. If you need to process the file in order, then alter the original script to:

put 1 into tCounter repeat for each line l in tFileContents if word 1 of l is "START" then put l && tCounter into currentobj add 1 to tCounter else put l&cr after gBigArray[currentobj] end if end repeat

The variable tCounter increments each time a new element is created, and appends the current element number as a seperate word to the end of the element name. When getting the list of array elements, you need to sort them:

put keys(gBigArray) into tListOfElements sort lines of tListOfElements numeric by last word of each

Did you find this article useful? Have any ideas for future topics? Email Us!