behavior of regexp ( ) function
Daniel J Sebald
daniel.sebald at ieee.org
Thu Jan 1 00:34:17 CST 2009
Below are some results from regexp() that seem questionable given what the documentation says (or I'm misunderstanding). Say I want to pull the substrings from a tab separated data file. Let
octave:6> a = sprintf('20\t50\tcelcius\t80')
a = 20 50 celcius 80
octave:7> b = sprintf('20\t50\t\t80')
b = 20 50 80
be some sample lines that might come from a datafile. String a has at least one character between tabs; b has a case where there are zero characters between tabs. For regexp, the metacharacters [^\t] mean any ASCII value other than a tab. The metacharacter + means match one or more times. Here are the results for a and b processed with these metacharacters:
octave:8> regexp(a, '[^\t]+', 'match')
ans =
{
[1,1] = 20
[1,2] = 50
[1,3] = celcius
[1,4] = 80
}
Looks good.
octave:9> regexp(b, '[^\t]+', 'match')
ans =
{
[1,1] = 20
[1,2] = 50
[1,3] = 80
}
I'll go along with that result too. There are zero characters between the second and third tab and + requires at least one match.
Now, according to the documentation, * is similar to + in concept, but there must be a match of _zero_ or more. Here's the results for a and b processed with those metacharacters:
octave:10> regexp(a, '[^\t]*', 'match')
ans =
{
[1,1] = 20
}
Doesn't look correct. I'm thinking this should be pretty much the same result as with metacharacter +, i.e.,
[1,1] = 20
[1,2] = 50
[1,3] = celcius
[1,4] = 80
because + was one or more matches, and "one or more" is a subset of "zero or more". Next result:
octave:11> regexp(b, '[^\t]*', 'match')
ans =
{
[1,1] = 20
}
Same as previous, but the way I see it, this case should result in
[1,1] = 20
[1,2] = 50
[1,3] = []
[1,4] = 80
where the third empty string comes from the fact there are zero characters between two tabs, i.e., "zero or more".
Am I correctly understanding what "zero or more" means?
Dan
More information about the Octave-maintainers
mailing list