Erlang Central

Difference between revisions of "Matching Words"

From ErlangCentral Wiki

 
(answer a bit differently, also removing ref to PCRE. The Cook Book really isn't the place to ask for language enhancements.)
Line 8: Line 8:
  
 
<code>
 
<code>
 +
matches(H,{match,M}) -> matches(H,M,[]).
 +
matches(_,[],Acc) -> Acc;
 +
matches(H,[{I,L}|T],Acc) ->
 +
    matches(H,T,[lists:sublist(H,I,L)|Acc]).
 +
 +
words(String, Regexp) -> matches(String,regexp:matches(String, Regexp)).
 +
 
Words_1 = "[^ ]+".        % as many non-whitespace bytes as possible
 
Words_1 = "[^ ]+".        % as many non-whitespace bytes as possible
 
Words_2 = "[A-Za-z'-]+".  % as many letters, apostrophes, and hyphens
 
Words_2 = "[A-Za-z'-]+".  % as many letters, apostrophes, and hyphens
  
1> regexp:first_match("'alpha-beta gamma", Words_1).
+
1> words("'alpha-beta gamma theta", Words_1).
{match,1,11}
+
["'alpha-beta","gamma","theta"]
2> string:substr("'alpha-beta gamma",1,11).
+
3> words("'alpha-beta&or gamma theta", Words_2).
"'alpha-beta"
+
["'alpha-beta", "or", "gamma", "theta"]
3> regexp:first_match("'alpha-beta&or gamma", Words_2).
+
{match,1,11}
+
4> string:substr("'alpha-beta&or gamma",1,11). 
+
"'alpha-beta"
+
 
</code>
 
</code>
  
Line 26: Line 29:
  
 
The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.
 
The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.
 
Unfortunately, there is no existing Perl-compatible regular expression module for use in Erlang.
 
  
 
[[Category:CookBook]][[Category:Regular_Expressions]]
 
[[Category:CookBook]][[Category:Regular_Expressions]]

Revision as of 18:52, 24 September 2006

Problem

You want to select words from a string.

Solution

Determine the defining features of a word for your specific application, then write a regular expression that models this idea.

matches(H,{match,M}) -> matches(H,M,[]).
matches(_,[],Acc) -> Acc;
matches(H,[{I,L}|T],Acc) ->
    matches(H,T,[lists:sublist(H,I,L)|Acc]).

words(String, Regexp) -> matches(String,regexp:matches(String, Regexp)).

Words_1 = "[^ ]+".        % as many non-whitespace bytes as possible
Words_2 = "[A-Za-z'-]+".  % as many letters, apostrophes, and hyphens

1> words("'alpha-beta gamma theta", Words_1).
["'alpha-beta","gamma","theta"]
3> words("'alpha-beta&or gamma theta", Words_2).
["'alpha-beta", "or", "gamma", "theta"]

Discussion

Erlang does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.

The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.