[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

Earl Hood earl at earlhood.com
Sun Jun 12 08:07:47 JST 2005


> > This is a "+from:earl" search.  Notice how the subject links in the
> > results are clipped.  The first parts of the subject text is not
> > printed.  However, examining NMZ.fields.subject shows that the complete
> > subjects are present.

After some further analysis (and many hours), I have at determined
what triggers the malformed utf-8 errors.

The problem is in html::decode_numbered_entity, which is invoked in the
regex's used by html::decode_entity.  If it is passed a number >= 160.
The call to sprintf() causes $$contref to get the UTF-8 flag set on
it (regardless that the locale is set to 'C'), causing Perl to do
subsequent utf-8 checks.

For example, the data input contain strings like:

  について助けてください!

When add the following to html::decode_numbered_entity:

      return ""
	  if $num >= 255;

All problems go away.  Search results do not clip out subjects anymore
and searching for "PHP" provides hits.

I believe the subject clipping occurs do to length offsets being wrong.
I think the offset written by mkmnz does not take into account of
UTF-8 encoding of text written (due to the utf-8 flag getting set).
I.e. When computing the "size" it is actually getting the number
of _characters_ in the data and not the number of _octets_ that are
actually written.  This could cause the funny "clipping" of subjects,
or the wrong subjects being listed in search results.

Now, it must be determined what is the proper fix for this.  Is
the above hack sufficient, or does something more robust needed?

--ewh


More information about the Namazu-users-en mailing list