Namazu-users-en(old)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Polish characters in Namazu



On April 15, 2003 at 14:40, Bartosz Feński wrote:
(B
(B> > By the way, the actual article in the mharc-users list seems
(B> > printing-quotable encoding. Namazu couldn't handle such data.
(B> So the point is that Namazu couldn't work with ISO-8859-2 characters ?
(B
(BThe quoted-printable is irrelevant.  Namazu is indexing the mhonarc
(Bmessage pages, so the quoted-printable data would have been decoded
(Bby mhonarc before namazu indexes the file.
(B
(B> > And character entity references in HTML file is olso supported by
(B> > Namazu. I think the probrem is in this case.
(B> Is there any way to fix it ?
(B> I've got locale set to pl_PL (ISO-8859-2).
(B> This is an only hint I've found in documentation of Namazu.
(B
(BDoing a simple experiment it appears namazu is 8-bit charset agnostic.
(BLooking at the Perl filters for Namazu, however, shows some potential
(Bproblems with numeric character entity references, like ę,
(Blatin small letter e with ogonek (the code point 0xEA in ISO-8859-2
(Bmaps to the Unicode code point 0x119).
(B
(BIn Namazu's html.pl filter, the routine decode_numbered_entity() does
(Bnot appear to support numeric entities greater than 127.  Therefore,
(Bsomething like ę gets mapped to the empty string.
(B
(BThe routine could (should?) be changed to allow values up to decimal
(B255, but in this case, it will not help since 0x119 is 281 decimal,
(Bmaking it greater than an 8-bit value.
(B
(BTherefore, you could configure mhonarc to not have it convert 8-bit
(Biso-8859-2 characters into entity references, making it the default
(Blocale set.  For example:
(B
(B<CharsetConverters>
(Biso-8859-2; mhonarc::htmlize
(B</CharsetConverters>
(B
(BIf you do this, you should change the IDXPGBEGIN, TIDXPGBEGIN, and
(BMSGPGBEGIN resource to include the following:
(B
(B  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
(B
(BSo browsers know that iso-8859-2 is the default document character set.
(B
(B--ewh