namazu-dev(ring)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mail from the author of xpdf



Satoru Takabayashi <satoru-t@xxxxxxxxxxxxxxxxxx> wrote:

>次のようなメイルが届きました。今日は試験勉強をしないといけな
>いので返事は明日以降に出します。

次のような返事を出しておきました。問題があれば指摘してくださ
いませ。


Thank you for emailing me for the enquiry below.

"Derek B. Noonburg" <derekn@xxxxxxxxxxx> wrote:

>I received email from Arumugam-san asking about using your Namazu search
>software to index and search PDF files.  I'm the author of xpdf, which
>includes a program called pdftotext that extracts the text from PDF
>files.  Currently, xpdf can display Japanese text, but pdftotext cannot
>extract it (pdftotext only handles 8-bit fonts).

Wow, it's a good news not only for me but for all Japanese
UNIX users struggling to handle PDF files in text processing!


>It should not be too hard for me to add support for Japanese text to
>pdftotext.  One thing I need to know is: what encoding does Namazu use
>for Japanese text?  PDF files use Adobe Japan1-2 (and variations)
>internally.  I already have a mapping from Japan1-2 to JIS X 0208-1983.
>Is this useful?  Also, is there some way of distinguishing 8-bit and
>16-bit characters in the same text file?

Namazu uses a tool called NKF[*1] for reading Japanese texts.
NKF can handle Japanese texts encoded in ISO-2022-JP (RFC
1468), EUC-JP (Extended UNIX Code) and Shift_JIS (which made
by Microsoft).

  1. <ftp://ftp.ie.u-ryukyu.ac.jp/pub/software/kono/nkf171.shar>

The internal encoding of Namazu is EUC-JP. My choosing
EUC-JP is that it is very easy for Perl to handle Japanese
texts. (ISO-2022-JP and Shift_JIS are cumbersome
 to handle.) 

If you want to remove all Japanese characters in a text, you
can just do like this (in Perl):

  # text containing Japanese characters encoded in EUC-JP
  $content;
  # remove all Japanese characters
  $content =~ s/[\xa1-\xfe][\xa1-\xfe]//g;

In short, the regex "[\xa1-\xfe][\xa1-\xfe]" expresses one
Japanese character which takes 16-bit in a text. In an
EUC-JP encoded text, a charset of Japanese characters is JIS
X 0208-1983 and all MSB of their characters are set to 1.

Since EUC-JP cannot contain any 8-bit characters coded in
the range of [\x80-\xff] but can contain only ASCII charset
which has the code range of [\x00-\x7f], you can distinguish
8-bit ASCII characters from 16-bit JIS X 0208-1983
characters easily.

In other words, EUC-JP is an encoding constructed in the
following rules:

  * For 8-bit characters, EUC-JP uses ASCII charset which
    takes 8-bit and has the code range of [\x00-\x7f].
  * For 16-bit characters, EUC-JP uses JIS X 0208-1983
    charset which takes 16-bit and set all MSB of their
    codes to 1. So, the code range is [\xa1-\xfe].

Strictly speaking, there are additional rules for encoding
JIS X 0201, the so-called Hankaku-Kana, and JIS X 0212, the
so-called Hojo-Kanji. But those are rarely used, and
moreover, those rules are difficult to follow. So, I think
you can ignore those rules. (I always ignore those rules.)

If you want to know correct information on Japanese text
processing, I recommend you to consult the book:

  * CJKV Information Processing : Chinese, Japanese, Korean & Vietnamese
    <http://www.oreilly.com/catalog/cjkvinfo/noframes.html>

and check out its author's webpage:

  * Ken Lunde's Home Page
    <http://www.ora.com/people/authors/lunde/>


Anyway, I know another tool that extracts a text from a PDF
file. PDF2TXT written in perl by <ishida@xxxxxxxxxxxxxxx>
can handle Japanese texts. I think this will be helpful to
get information on how to support Japanese texts. You can
get PDF2TXT from the following URI.

  * /pub/person/ishida/freeware/pdf2txt directory
    <ftp://paprika.noc.intec.co.jp/pub/person/ishida/freeware/pdf2txt/>


If you have any question, please feel free to email me.

Regards,
-- Satoru Takabayashi

>mimasa氏に Cc: します。(助言が欲しい)

mimasa氏は最近、何をしています? 暇なら 6月5日の宴会に来ませ
ん? (mimasa氏の仕事に暇という状況はないのでしょうけど ;-)