On Wed, 13 Oct 2021 16:02:14 -0500
J Leslie Turriff <jlturriff(a)mail.com> wrote:
On 2021-10-13 13:07:13 E. Liddell wrote:
On Wed, 13 Oct 2021 16:46:20 +0000
That being said, test 9 is a raw grep being performed on an XML file. This
means that it could easily be latching onto something in a comment, because
following the full XML spec for determining whether a given line is inside
a comment or not using a simple text-matching tool is . . . well, let's say
it isn't something I'd want to try, and I deal in regexes a fair amount in
my day job. It really needs to be run through a full parser that constructs
a DOM tree.
Filter to throw away comments first, then filter for what it should look for.
Correctly throwing away comments isn't as simple as tossing away everything
between a start marker and an end marker, though, because if the comment
marker is inside a CDATA section, it doesn't actually affect whether or not
the text is a comment. I suspect a comment marker found between quotes in
a text-format attribute value doesn't count either, but I'd have to check the spec
to be sure. And there may be more quirks that I've forgotten. (Oh, and you
could *easily* embed the value the grep expression is looking for in the file
without triggering the grep by using CDATA, now that I think about it.)
There's a reason that man perlfaq6 contains the following:
How do I match XML, HTML, or other nasty, ugly things with a regex?
Do not use regexes. Use a module and forget about the regular expressions.
E. Liddell