[perl.git] branch smoke-me/trie2, created. v5.15.7-490-gda18153

demerphq

2012-02-20 14:51:44 UTC

Just in case anybody cares, I rebased this against the most recent
blead to see what happens to the Win32 errors which I do not think are
related to the patch.

yves

Post by Yves Orton
In perl.git, the branch smoke-me/trie2 has been created
<http://perl5.git.perl.org/perl.git/commitdiff/da181531d429d85afc7242ab6c2b763d42fad35f?hp=0000000000000000000000000000000000000000>
at da181531d429d85afc7242ab6c2b763d42fad35f (commit)
- Log -----------------------------------------------------------------
commit da181531d429d85afc7242ab6c2b763d42fad35f
Date: Sun Feb 19 21:32:05 2012 +0100
rework how the trie logic handles the newer EXACT nodetypes
This cleans up and simplifies and extends how the trie
logic interacts with the new node types. This change ultimately
makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
of whether the string is utf8 or latin1.
EXACT => utf8 or "binary" text
EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules
EXACTF => "old style fold logic" untriable nodetype
EXACTFA => (currently) untriable nodetype
EXACTFL => (currently) untriable nodetype
See the comments in regcomp.sym for these fold types.
This patch involves a number of distinct, but related parts. Starting
* Simplify how we detect a triable sequence given the new nodetypes,
this also probably fixed some "bugs" in how we detected certain
sequences, like /||foo|bar/.
* Simplify how we read EXACTFU nodes under utf8 by removing the now
redundant folding logic (EXACTFU nodes under utf8 are prefolded).
Also extend this logic to handle latin1 patterns properly (in
conjunction with other changes)
* Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
have to do with how the trie logic interacts with the minlen logic.
This change handles both by pessimising the minlen when encounting
these nodetypes. One observation is that the minlen logic is basically
broken, and works only because it conflates bytes and codepoints in
such a way that we more or less always get a value small enough that things work out
anyway. Fixing that is properly is the job of another patch.
* Part of the problem of doing folding under unicode rules is that
there are a lot of foldings possible, some with strange rules. This
means that the bitmap logic does not work correctly in all cases,
as we currently do not have any way to populate it properly.
So this patch disables the bitmap entirely when folding is involved
until that is fixed.
The end result of this is: we can TRIE/AHOCORASICK any sequence of
EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
the bitmap when folding.
A note for follow up relating to this patch is that the way EXACTFU_XXX
nodes are currently dealt with we wont build the "maximal" trie because
of their presence, instead creating a "jumptrie" consisting of either a
leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
should eventually address that.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
M regcomp.sym
M regexec.c
M regnodes.h
M t/re/fold_grind.t
commit 69be4c6300d5b40e0d7a3562b2bcd870cc5f205a
Date: Sun Feb 19 21:04:44 2012 +0100
make test.pl show test number and name in failure diagnostics output
The old output would show only the line number as diagnostics
but not the test number, nor the test name, which often contains
very useful information. This patch makes sure this is visible in
the diagnostics output of test failures.
M t/test.pl
-----------------------------------------------------------------------
--
Perl5 Master Repository

--
perl -Mre=debug -e "/just|another|perl|hacker/"