[Namazu-users-ja 524] NMZ.field.subject にファイル名が入る

野宮 賢 / NOMIYA Masaru nomiyac360 @ mg.point.ne.jp
2005年 9月 24日 (土) 16:11:06 JST


野宮です。

環境は、下記の通りです。

# mknmz -C
読み込んだ設定ファイル: /etc/namazu/mknmzrc /home/masaru/.mknmzrc
システム: linux
Namazu: 2.0.14
Perl: 5.008006
File-MMagic: 1.22
NKF: module_nkf
KAKASI: module_kakasi -ieuc -oeuc -w
茶筌: module_chasen -j -F '%m '
わかち書き: module_kakasi -ieuc -oeuc -w
メッセージの言語: ja_JP
言語: ja_JP
文字コード: euc
CONFDIR: /etc/namazu
LIBDIR: /usr/share/namazu/pl
FILTERDIR: /usr/share/namazu/filter
TEMPLATEDIR: /usr/share/namazu/template
対応メディアタイプ:   (26)
未対応メディアタイプ: (7) 必要ツールが $path にないものには (-) を表示
  application/excel: excel.pl
  application/ichitaro5: taro56.pl
  application/ichitaro6: taro56.pl
- application/ichitaro7: taro7_10.pl
  application/macbinary: macbinary.pl
  application/msword: msword.pl
  application/pdf: pdf.pl
- application/postscript: postscript.pl
  application/powerpoint: powerpoint.pl
- application/rtf: rtf.pl
  application/vnd.sun.xml.calc: ooo.pl
  application/vnd.sun.xml.draw: ooo.pl
  application/vnd.sun.xml.impress: ooo.pl
  application/vnd.sun.xml.writer: ooo.pl
  application/x-apache-cache: apachecache.pl
  application/x-bzip2: bzip2.pl
  application/x-compress: compress.pl
- application/x-deb: deb.pl
  application/x-dvi: dvi.pl
  application/x-gzip: gzip.pl
- application/x-js-taro: taro7_10.pl
  application/x-rpm: rpm.pl
- application/x-tex: tex.pl
- audio/mpeg: mp3.pl
  message/news: mailnews.pl
  message/rfc822: mailnews.pl
  text/hnf: hnf.pl
  text/html: html.pl
  text/html; x-type=mhonarc: mhonarc.pl
  text/plain
  text/plain; x-type=rfc: rfc.pl
  text/x-hdml: hdml.pl
  text/x-roff: man.pl

因に、

nkf 2.04
kakasi 2.3.4
Text::Kakasi 1.05

をインストールしています。

mknmz の対象は、

  ~/var/news/foo
          ../foo001
          ../foo002
           :
          ../bar
          ../bar001
          ../bar002
           :

というディレクトリ構造の配下にある MH 形式のファイル群で、ファイル名は、全
て 1 〜 500 の数字で、ファイルの Charset は、全て ISO-2022-JP です。
この環境で、

  # mknmz -aEK -O ~/.News.nmz -f ~/.mknmzrc ~/var/news

を実行しますと、NMZ.field.subject に

1
2
3
:  
500
1
2
:

といったように、各ファイルの Subject ではなく、ファイル名が格納されます。
ところが、普通のメールが対象ですと、NMZ.field.subject には、ちゃんと
Subject が格納されます。
そこで、対象としているファイルの内部構成の問題かな?、とも考えています。
つまり、対象としていますファイルは、



Subject: =?ISO-2022-JP?B?GyRCQ09KfTh4TDMwdyRONWtNPzpvODohIjojRy8kTzpHGyhC?=
 =?ISO-2022-JP?B?GyRCOWIkTiMxIzQjNSMxMi8xXxsoQg==?=
From: =?ISO-2022-JP?B?GyRCbCZHZD83SjkbKEIgKBskQkAvPCMbKEIp?=
 <webmaster @ www.yomiuri.co.jp>
Date: Sat, 24 Sep 2005 03:06 +0900
Message-ID: <20050924.ia01%politics.yomiuri.co.jp>
Lines: 0
Xref: http://www.yomiuri.co.jp/politics/news/20050924ia01.htm
X-Face: Ygq$6P.,%Xt$U)DS)cRY @ k$VkW!7(X'X'?U{{osjjFG"E]hND;SPJ-J?O?R|a?Lg2$0rVng=O3Lt}?~IId8Jj&vP^3*o=LKUyk(`t%0c!;t6REk=JbpsEn9MrN7gZ%
Content-Type: text/plain; charset=ISO-2022-JP
MIME-Version: 1.0

 今年4月1日現在の地方公務員給与の削減額が前年同期比46億円増の1451億円に
[...]
どの給与削減を行っていた。昨年は44%だった。削減率が高かったのは、島根県の10〜
6%、長野県の10〜5%など。

(2005年9月24日3時6分  読売新聞)



というもので、メールにあるヘッダー部がありません。
こういったファイルが対象であっても NMZ.field.subject に Subject を格納させ
る手立がありましたら、お教え戴きたく、宜しくお願いします。

尚、.mknmzrc は、

#
# This is a Namazu configuration file for mknmz.
#
package conf;  # Don't remove this line!

#===================================================================
#
# Administrator's email address
#
# $ADDRESS = 'webmaster @ snell.suse.de';


#===================================================================
#
# Regular Expression Patterns
#

#
# This pattern specifies HTML suffixes.
#
$HTML_SUFFIX = "html?|[ps]html|html\\.[a-z]{2}";

#
# This pattern specifies file names which will be targeted.
# NOTE: It can be specified by --allow=regex option.
#       Do NOT use `$' or `^' anchors.
#       Case-insensitive.
#
$ALLOW_FILE =   ".*\\.(?:$HTML_SUFFIX)|.*\\.txt" . # HTML, plain text
                "|.*\\.gz|.*\\.Z|.*\\.bz2" .       # Compressed files
                "|.*\\.pdf|.*\\.ps" .              # PDF, PostScript
                "|.*\\.tex|.*\\.dvi" .             # TeX, DVI
                "|.*\\.rpm|.*\\.deb" .             # RPM, DEB
                "|.*\\.doc|.*\\.xls|.*\\.pp[st]" . # Word, Excel, PowerPoint
                "|.*\\.j[sabf]w|.*\\.jtd" .        # Ichitaro 4, 5, 6, 7, 8
                "|.*\\.sx[widc]" .                 # OpenOffice Writer,Calc,Impress,Draw
                "|.*\\.rtf" .                      # Rich Text Format
                "|.*\\.hdml" .                     # HDML
                "|.*\\.mp3" .                      # MP3 
                "|\\d+|[-\\w]+\\.[1-9n]";          # Mail/News, man

#
# This pattern specifies file names which will NOT be targeted.
# NOTE: It can be specified by --deny=regex option.
#       Do NOT use `$' or `^' anchors.
#       Case-insensitive.
#
$DENY_FILE = ".*\\.(gif|png|jpg|jpeg)|.*\\.tar\\.gz|core|.*\\.bak|.*~|\\..*|\x23.*";

#
# This pattern specifies PATHNAMEs which will NOT be targeted.
# NOTE: Usually specified by --exclude=regex option.
#
$EXCLUDE_PATH = undef;

#
# This pattern specifies file names which can be omitted 
# in URI.  e.g., 'index.html|index.htm|Default.html'
#
# NOTE: This is similar to Apache's "DirectoryIndex" directive.
#
$DIRECTORY_INDEX = "";

#
# This pattern specifies Mail/News's fields in its header which 
# should be searchable.  NOTE: case-insensitive
#
$REMAIN_HEADER = "From|Date|Message-ID";

#
# This pattern specifies fields which used for field-specified 
# searching.  NOTE: case-insensitive
# 
$SEARCH_FIELD = "message-id|subject|from|date|uri|newsgroups|to|summary|size";

#
# This pattern specifies meta tags which used for field-specified 
# searching.  NOTE: case-insensitive
#
$META_TAGS = "keywords|description";

#
# This pattern specifies aliases for NMZ.field.* files.
# NOTE: Editing NOT recommended.
#
%FIELD_ALIASES = ('title' => 'subject', 'author' => 'from');

#
# This pattern specifies HTML elements which should be replaced with 
# null string when removing them. Normally, the elements are replaced 
# with a single space character.
#
$NON_SEPARATION_ELEMENTS = 'A|TT|CODE|SAMP|KBD|VAR|B|STRONG|I|EM|CITE|FONT|U|'.
                       'STRIKE|BIG|SMALL|DFN|ABBR|ACRONYM|Q|SUB|SUP|SPAN|BDO';

#
# This pattern specifies attribute of a HTML tag which should be 
# searchable.
#
$HTML_ATTRIBUTES = 'ALT|SUMMARY|TITLE';

#===================================================================
# 
# Critical Numbers
# 

# 
# The max size of files which can be loaded in memory at once.
# If you have much memory, you can increase the value.
# If you have less memory, you can decrease the value.
#
$ON_MEMORY_MAX   = 5000000;

#
# The max file size for indexing. Files larger than this 
# will be ignored.
# NOTE: This value is usually larger than TEXT_SIZE_MAX because 
#       binary-formated files such as PDF, Word are larger.
#
$FILE_SIZE_MAX   = 2000000;

#
# The max text size for indexing. Files larger than this 
# will be ignored.
#
$TEXT_SIZE_MAX   =  600000;

#
# The max length of a word. the word longer than this will be ignored.
#
$WORD_LENG_MAX   = 128;


#
# Weights for HTML elements which are used for term weightning.
#
# %Weight = 
#     (
#      'html' => {
#          'title'  => 16,
#          'h1'     => 8,
#          'h2'     => 7,
#          'h3'     => 6,
#          'h4'     => 5,
#          'h5'     => 4,
#          'h6'     => 3,
#          'a'      => 4,
#          'strong' => 2,
#          'em'     => 2,
#          'kbd'    => 2,
#          'samp'   => 2,
#          'var'    => 2,
#          'code'   => 2,
#          'cite'   => 2,
#          'abbr'   => 2,
#          'acronym'=> 2,
#          'dfn'    => 2,
#      },
#      'metakey' => 32, # for <meta name="keywords" content="foo bar">
#      'headers' => 8,  # for Mail/News' headers
# );

#
# The max length of a HTML-tagged string which can be processed for
# term weighting. 
# NOTE: There are not a few people has a bad manner using 
#       <h[1-6]> for changing a font size.
#
$INVALID_LENG = 128; 

#
# The max length of a field.
# This MUST be smaller than libnamazu.h's BUFSIZE (usually 1024).
#
$MAX_FIELD_LENGTH = 200;


#===================================================================
#
# Softwares for handling a Japanese text
#

#
# Network Kanji Filter nkf v1.62 or later
#
$NKF = "module_nkf"; 

#
# KAKASI
#
$KAKASI = "module_kakasi -ieuc -oeuc -w";

#
# ChaSen 1.51 or later (simple wakatigaki)
#
# $CHASEN = "module_chasen -j -F '\%m '";

#
# ChaSen 1.51 or later (with noun words extraction)
#
# $CHASEN_NOUN = "module_chasen -j -F '\%m %H\\n'";

#
# Default Japanese processer: KAKASI or ChaSen.
#
$WAKATI  = $KAKASI;


#===================================================================
#
# Directories
#
$LIBDIR = "/usr/bin/perl5";
$FILTERDIR = "usr/share/namazu/filter";
$TEMPLATEDIR = "/usr/share/namazu/template";

# 1;

としています。

---
  野宮  賢             mail-to: nomiyac360 @ mg.point.ne.jp

       「eメールや携帯電話に縛られた社会は、自分自身と向き合ったり、
        空想にふけったりする自由を奪う。」
                                                  -- M. Crichton --



Namazu-users-ja メーリングリストの案内