Discussion:
[perl.git] branch smoke-me/trie2, created. v5.15.7-490-gda18153
(too old to reply)
Yves Orton
2012-02-20 14:34:07 UTC
Permalink
In perl.git, the branch smoke-me/trie2 has been created

<http://perl5.git.perl.org/perl.git/commitdiff/da181531d429d85afc7242ab6c2b763d42fad35f?hp=0000000000000000000000000000000000000000>

at da181531d429d85afc7242ab6c2b763d42fad35f (commit)

- Log -----------------------------------------------------------------
commit da181531d429d85afc7242ab6c2b763d42fad35f
Author: Yves Orton <***@gmail.com>
Date: Sun Feb 19 21:32:05 2012 +0100

rework how the trie logic handles the newer EXACT nodetypes

This cleans up and simplifies and extends how the trie
logic interacts with the new node types. This change ultimately
makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
of whether the string is utf8 or latin1.

This patch depends on the following:

EXACT => utf8 or "binary" text

EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules

EXACTF => "old style fold logic" untriable nodetype
EXACTFA => (currently) untriable nodetype
EXACTFL => (currently) untriable nodetype

See the comments in regcomp.sym for these fold types.

This patch involves a number of distinct, but related parts. Starting
from compilation:

* Simplify how we detect a triable sequence given the new nodetypes,
this also probably fixed some "bugs" in how we detected certain
sequences, like /||foo|bar/.

* Simplify how we read EXACTFU nodes under utf8 by removing the now
redundant folding logic (EXACTFU nodes under utf8 are prefolded).
Also extend this logic to handle latin1 patterns properly (in
conjunction with other changes)

* Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
have to do with how the trie logic interacts with the minlen logic.
This change handles both by pessimising the minlen when encounting
these nodetypes. One observation is that the minlen logic is basically
broken, and works only because it conflates bytes and codepoints in
such a way that we more or less always get a value small enough that things work out
anyway. Fixing that is properly is the job of another patch.

* Part of the problem of doing folding under unicode rules is that
there are a lot of foldings possible, some with strange rules. This
means that the bitmap logic does not work correctly in all cases,
as we currently do not have any way to populate it properly.
So this patch disables the bitmap entirely when folding is involved
until that is fixed.

The end result of this is: we can TRIE/AHOCORASICK any sequence of
EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
the bitmap when folding.

A note for follow up relating to this patch is that the way EXACTFU_XXX
nodes are currently dealt with we wont build the "maximal" trie because
of their presence, instead creating a "jumptrie" consisting of either a
leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
should eventually address that.

M embed.fnc
M embed.h
M proto.h
M regcomp.c
M regcomp.sym
M regexec.c
M regnodes.h
M t/re/fold_grind.t

commit 69be4c6300d5b40e0d7a3562b2bcd870cc5f205a
Author: Yves Orton <***@gmail.com>
Date: Sun Feb 19 21:04:44 2012 +0100

make test.pl show test number and name in failure diagnostics output

The old output would show only the line number as diagnostics
but not the test number, nor the test name, which often contains
very useful information. This patch makes sure this is visible in
the diagnostics output of test failures.

M t/test.pl
-----------------------------------------------------------------------

--
Perl5 Master Repository
demerphq
2012-02-20 14:51:44 UTC
Permalink
Just in case anybody cares, I rebased this against the most recent
blead to see what happens to the Win32 errors which I do not think are
related to the patch.

yves
Post by Yves Orton
In perl.git, the branch smoke-me/trie2 has been created
<http://perl5.git.perl.org/perl.git/commitdiff/da181531d429d85afc7242ab6c2b763d42fad35f?hp=0000000000000000000000000000000000000000>
       at  da181531d429d85afc7242ab6c2b763d42fad35f (commit)
- Log -----------------------------------------------------------------
commit da181531d429d85afc7242ab6c2b763d42fad35f
Date:   Sun Feb 19 21:32:05 2012 +0100
   rework how the trie logic handles the newer EXACT nodetypes
   This cleans up and simplifies and extends how the trie
   logic interacts with the new node types. This change ultimately
   makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
   EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
   of whether the string is utf8 or latin1.
       EXACT              => utf8 or "binary" text
       EXACTFU            => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
       EXACTFU_SS         => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
       EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules
       EXACTF             => "old style fold logic" untriable nodetype
       EXACTFA            => (currently) untriable nodetype
       EXACTFL            => (currently) untriable nodetype
   See the comments in regcomp.sym for these fold types.
   This patch involves a number of distinct, but related parts. Starting
   * Simplify how we detect a triable sequence given the new nodetypes,
     this also probably fixed some "bugs" in how we detected certain
     sequences, like /||foo|bar/.
   * Simplify how we read EXACTFU nodes under utf8 by removing the now
     redundant folding logic (EXACTFU nodes under utf8 are prefolded).
     Also extend this logic to handle latin1 patterns properly (in
     conjunction with  other changes)
   * Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
     have to do with how the trie logic interacts with the minlen logic.
     This change handles both by pessimising the minlen when encounting
     these nodetypes. One observation is that the minlen logic is basically
     broken, and works only because it conflates bytes and codepoints in
     such a way that we more or less always get a value small enough that things work out
     anyway. Fixing that is properly is the job of another patch.
   * Part of the problem of doing folding under unicode rules is that
     there are a lot of foldings possible, some with strange rules. This
     means that the bitmap logic does not work correctly in all cases,
     as we currently do not have any way to populate it properly.
     So this patch disables the bitmap entirely when folding is involved
     until that is fixed.
   The end result of this is: we can TRIE/AHOCORASICK any sequence of
   EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
   the bitmap when folding.
   A note for follow up relating to this patch is that the way EXACTFU_XXX
   nodes are currently dealt with we wont build the "maximal" trie because
   of their presence, instead creating a "jumptrie" consisting of either a
   leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
   should eventually address that.
M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c
M       regcomp.sym
M       regexec.c
M       regnodes.h
M       t/re/fold_grind.t
commit 69be4c6300d5b40e0d7a3562b2bcd870cc5f205a
Date:   Sun Feb 19 21:04:44 2012 +0100
   make test.pl show test number and name in failure diagnostics output
   The old output would show only the line number as diagnostics
   but not the test number, nor the test name, which often contains
   very useful information. This patch makes sure this is visible in
   the diagnostics output of test failures.
M       t/test.pl
-----------------------------------------------------------------------
--
Perl5 Master Repository
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Loading...