Namazu-users-en(old)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Malformed UTF-8 character ...

From: Pankaj K Garg <garg@xxxxxxxxxxxxx>
Date: Wed, 05 May 2004 15:30:45 -0700
X-ml-name: namazu-users-en
X-mail-count: 00499
References: <200405052211.i45MBXx19118@gator.earlhood.com>

I got the same problem with Perl, which got fixed by using
LANG=en_US.ISO8859-1

Earl Hood wrote:

Namazu version: 2.0.13
Perl version: 5.8.4
OS: Linux 2.4.21-4.ELsmp #1 SMP Fri Oct 3 17:52:56 EDT 2003 i686 i686 i386 GNU/Linux

Running mknmz generates the following message repeatedly:

Malformed UTF-8 character (unexpected continuation byte 0xa4, with no preceding start byte) in pattern match (m//) at /usr/local/share/namazu/filter/mailnews.pl line 216, <GEN5> line 71. ...


Figuring it was a LANG envariable setting, I explicitly sent LANG
to en_US (it was defaulted to en_US.UTF-8), but it did not fix it.
Maybe I should try en_US.ISO-8859-1?

To suppress the message I added a "use bytes" pragma to mailnews.pl
to avoid Perl doing any character processing:

--- mailnews.pl.20040505        2004-05-05 14:52:23.000000000 -0700
+++ mailnews.pl 2004-05-05 14:53:56.000000000 -0700
@@ -209,6 +209,7 @@ sub mailnews_citation_filter ($$) {
     $$contref = "";
     my $i = 0;
     for my $line (@tmp) {
+	use bytes;
	# Complete excluding is impossible. I tnink it's good enough.
         # Process only first five paragrahs.
	# And don't handle the paragrah which has five or longer lines.

I put the pragma just within the block that was generating the
warnings.

I'm unsure if this is the best fix, but since mailnews.pl contains
8-bit values in a regex, something should be done to avoid Perl
trying to interpret the octets under a character encoding.

It may be better to conditionalize the code based upon language
setting.  I.e.  Have a different regex for each support locale.

--ewh

--
Pankaj K Garg                         garg@xxxxxxxxxxxxx
1684 Nightingale Avenue               408-373-4027
Suite 201                             408-733-2737(fax)
Sunnyvale, CA 94087

		http://www.zeesource.net

References:
- Malformed UTF-8 character ...
  - From: Earl Hood

Prev by Date: Malformed UTF-8 character ...
Next by Date: Re: Malformed UTF-8 character ...
Previous by thread: Malformed UTF-8 character ...
Next by thread: Re: Malformed UTF-8 character ...
Index(es):
- Date
- Thread