Namazu-users-en(old)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Malformed UTF-8 character ...



I got the same problem with Perl, which got fixed by using
LANG=en_US.ISO8859-1

Earl Hood wrote:
Namazu version: 2.0.13
Perl version: 5.8.4
OS: Linux 2.4.21-4.ELsmp #1 SMP Fri Oct 3 17:52:56 EDT 2003 i686 i686 i386 GNU/Linux

Running mknmz generates the following message repeatedly:

Malformed UTF-8 character (unexpected continuation byte 0xa4, with no preceding start byte) in pattern match (m//) at /usr/local/share/namazu/filter/mailnews.pl
line 216, <GEN5> line 71.
...



Figuring it was a LANG envariable setting, I explicitly sent LANG to en_US (it was defaulted to en_US.UTF-8), but it did not fix it. Maybe I should try en_US.ISO-8859-1?

To suppress the message I added a "use bytes" pragma to mailnews.pl
to avoid Perl doing any character processing:

--- mailnews.pl.20040505        2004-05-05 14:52:23.000000000 -0700
+++ mailnews.pl 2004-05-05 14:53:56.000000000 -0700
@@ -209,6 +209,7 @@ sub mailnews_citation_filter ($$) {
     $$contref = "";
     my $i = 0;
     for my $line (@tmp) {
+	use bytes;
	# Complete excluding is impossible. I tnink it's good enough.
         # Process only first five paragrahs.
	# And don't handle the paragrah which has five or longer lines.

I put the pragma just within the block that was generating the
warnings.

I'm unsure if this is the best fix, but since mailnews.pl contains
8-bit values in a regex, something should be done to avoid Perl
trying to interpret the octets under a character encoding.

It may be better to conditionalize the code based upon language
setting.  I.e.  Have a different regex for each support locale.

--ewh


--
Pankaj K Garg                         garg@xxxxxxxxxxxxx
1684 Nightingale Avenue               408-373-4027
Suite 201                             408-733-2737(fax)
Sunnyvale, CA 94087

		http://www.zeesource.net