Re: [trinity-users] Hopeing I can find a regex expert here

23 Mar 2016


      On Wednesday 23 March 2016 07:22:03 E. Liddell wrote:
...
On Wed, 23 Mar 2016 15:58:39 +0900
Michele Calgaro michele.calgaro@yahoo.it wrote:
...
On 2016/03/23 02:19 PM, Gene Heskett wrote:
...
On Wednesday 23 March 2016 00:32:17 Michele Calgaro wrote:
...
On 2016/03/23 12:44 PM, Gene Heskett wrote:
...
Greetings;
I use mailfilter as a prefilter in front of fetchmail to nuke
some spam while its still on the server.
But its missing hits on what I suspect is the From: or
Return-Path: strings that have quotation marks in the string
because the string is being spec'd by being surrounded by "show
this name" bs.
I've added the character < as part of the string its to search
for, so the search string now looks like
"From:.*<*.unwanted-tld".  Does this stand that famous snow
balls chance in hell of working well with or without a quoted
"some funkity name" in front of the real url with the <> around
it?
I just love the lack of documentation on how this string
comparison stuff works as shown by the man pages for grep and
regex.  All sorts of control options are well covered, but
figureing out how to write a search expression must be one of
the worlds better guarded secrets.
So if someone could show me, or give a url that actually has the
full docs, I'd be greatfull.
Thanks.
Cheers, Gene Heskett
Hi Gene,
"From:.*<*.unwanted-tld" will match a string like this (I have
put one section per line to be cleaer): From:
whatever character
0 or more <
.unwanted-tld
I thought I wanted 1 only, but the way these lowlifes change
addresses and names hourly, they may remove the <> surrounding the
real source address and screw me up.  But the fact that they often
put dbl-qoutes around the throwaway part of the url, is I think
screwing me regardless.
What we need is the ability to specify the quote character by the
first non-space character after the DENY =, which is currently a
"^ or a <> which apparently inverts the logic.  So a typical line
would be
DENY = "^From:.*<*.bid"
Substitute any of the new tld's for bid that gets obnoxious.  Like
xyz, or .pro, heck that new list is several dozen tld's.
But AFAIK, we're stuck with the dblquote wrapper around the string
to match.  Grrrr.
...
It is greedy, so it will scan until the last < if there are more
than one. Not sure if this is what you need or not. If you can
post an example of what you need to match, I can workout another
regex if required.
Try this:
"-Bed Bugs-" -BedBugs-@agma69.top
with Return-Path.* or From.* in front of it.  Or does that - sign,
4 of them, need escaping with a \ ? IDK.
Hyphens should only need an escape if within a character class,
denoted by square brackets.
...
...
I converted about 3 lines of the filterdata file that way, and I'm
now waiting for the next blast of spam to serve as test data. 
mailfilter is a picky twit, but that hasn't given it a tummy ache
either, so I am hopefull.
...
PS: by the way, the internet is full of excellent documentation
about regex ;-) For example
"http://www.regular-expressions.info/"
Cheers, Gene Heskett
Hi Gene,
so if I understand correctly, you already had a set of rules like
DENY = "^From:.*.bid"  (bid stands for any tld of yuor choice)
but it was missing some entries because of the "..." entry before
the domain. So you put the < in the string as well.
Right?
Assuming so, it surprises me that the original version missed some
entries, since the additional "..." field would have already been
matched by the .* part of the pattern.
I think there is a different reason for missing entries. Perhaps a
black character before "From:"? Could it be? You could try this
other version:
DENY = "^\s*From:.*.bid"  which ignores any separator before From:
That would also sweep up, say, fred@mail.bidders.com, or
"I.bid" ibid@nowhere.org
...
or
DENY = "^\s*From:.*.bid>" which also makes explicit that the tld is
followed by a >.
I'd cover the example as
^\W*((From:)|(Return-Path:)).*.bid\W*$
which works out to zero or more non-word characters  at the beginning
of the string, followed by "From:" or "Return-Path:" followed by zero
or more unknowns, followed by ".bid", followed by zero or more
non-word characters, followed by the end of the string.  "Word"
characters are alphanumerics, some connectors like _-, and possibly
some non-ASCII depending on the implementation, so "non-word" covers
stuff like punctuation and whitespace.  Marking the end of the string
makes it more likely you're getting the TLD and not some random bit in
the middle that was designed as a parser torture-test.
If you want to get really silly,
^\W*((From:)|(Return-Path:)).*.[^cCoOnN][a-zA-Z][a-zA-Z]+\W*$
ought to catch the majority of TLDs with a 3+ ASCII character
extension that isn't .com, .org, or .net, but without a larger sample
of "good" and "bad" addresses, I can't guarantee no false positives.
I write a lot of regexes in my day job (which is not to say that I get
them right the first time, every time!)  Assuming a Perl-compatible
implementation (which most of them are, more or less), "man perlre" is
a decent reference for the complicated bits.  Just scroll past the
section on modifiers.
E. Liddell
Now that looks like the regex bible, Thanks a bunch.  That needs printed 
and placed in the middle of the house little room. :)
...

To unsubscribe, e-mail:
trinity-users-unsubscribe@lists.pearsoncomputing.net For additional
commands, e-mail: trinity-users-help@lists.pearsoncomputing.net Read
list messages on the web archive:
http://trinity-users.pearsoncomputing.net/ Please remember not to
top-post:
http://trinity.pearsoncomputing.net/mailing_lists/#top-posting
Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page http://geneslinuxbox.net:6309/gene

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [trinity-users] Hopeing I can find a regex expert here