Namazu-users-ja(旧)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

File::Magicでの挙動



  菅です。

  RedHat 9 のサーバで mknmz 時に pdf ファイル解析中(前?)に core を吐いて
  しまい、mknmz が実行できないという問題があります。

  同じファイルが Solaris 8 では問題なく動いていたので気になっています。

  構成は

  RedHat 9			Solaris 8
  Perl 5.8.0			Perl 5.6.1
  xpdf 2.02pl1			xpdf 2.00
  namazu 2.0.12			namazu 2.0.12

  です。

  で、問題のファイルは

% file *
5968-5161E.pdf: Macintosh MacBinary data, type "PDF " (Portable Document Format), creator "CARO"
5968-5162E.pdf: PDF document, version 1.2
5968-5163E.pdf: Macintosh MacBinary data, type "PDF " (Portable Document Format), creator "CARO"

  というもので、2000年頃にファイルタイプのチェックで高林さんと野首さんとで
  やり取りしたときのスクリプトを入れて試してみました。

% cat File
#! /usr/bin/perl -w

use strict;
#use FileHandle;
use File::MMagic; 
use Compress::Zlib;

for my $filename (@ARGV) {
	my $mm = new File::MMagic;
    $mm->addSpecials("text/plain; x-type=rfc",
                        "^Network Working Group",
                        "^Request for Comments:",
                        "^Obsoletes:",
                        "^Category:",
                        "^Updates:");
    $mm->addSpecials("application/x-tex",
        '^\\\\document(style|class)');
    $mm->addFileExts('\\.tex$', 'application/x-tex');

	my $fh = new FileHandle "< $filename";
	my $cont = join('', <$fh>);
	my $type = $mm->checktype_contents($cont);
	if ($type =~ /^application\/x-gzip/) {
		{
			my $offset = 0;
			$offset += 3;
			my $flags = unpack('C', substr($cont, $offset, 1));
			$offset += 1;
			$offset += 6;
			$cont = substr($cont, $offset);
			$cont = substr($cont, 2) if ($flags & 0x04);
			$cont =~ s/^[^\0]*\0// if ($flags & 0x08);
			$cont =~ s/^[^\0]*\0// if ($flags & 0x10);
			$cont = substr($cont, 2) if ($flags & 0x02);
		}
		my $x = inflateInit(-WindowBits     =>  - MAX_WBITS()) ;
		my ($inf, $stat) = $x->inflate($cont);
		$cont = $inf if $stat == Z_OK or $stat == Z_STREAM_END ;
		$type = $mm->checktype_contents($cont);
		print "Compressed:"
	}
	print "$filename: $type\n";
}

  結果は

manager:/home/manager# File *
5968-5161E.pdf: text/plain
5968-5162E.pdf: application/pdf
5968-5163E.pdf: text/plain

  です。

  なぜ、この後 core になってしまうかは不明なんですが、pdf だと認識できれば
  pdftotext では問題ないことがわかっています。

  どのような対処が必要でしょうか?

  因みに

  perl -MFile::MMagic -e '$m = new File::MMagic; print "$File::MMagic::VERSION\n"; $m->check_magic();'

  の結果も付けます。

0       string  =BZh    application/x-bzip2
0       string  =#VRML V1.0 ascii       model/vrml
0       string  =#VRML V2.0 utf8        model/vrml
0       short   =51966
>2      short   =47806  application/java
0       string  =.snd
>12     belong  =1      audio/basic
>12     belong  =2      audio/basic
>12     belong  =3      audio/basic
>12     belong  =4      audio/basic
>12     belong  =5      audio/basic
>12     belong  =6      audio/basic
>12     belong  =7      audio/basic
>12     belong  =23     audio/x-adpcm
0       lelong  =6583086
>12     lelong  =1      audio/x-dec-basic
>12     lelong  =2      audio/x-dec-basic
>12     lelong  =3      audio/x-dec-basic
>12     lelong  =4      audio/x-dec-basic
>12     lelong  =5      audio/x-dec-basic
>12     lelong  =6      audio/x-dec-basic
>12     lelong  =7      audio/x-dec-basic
>12     lelong  =23     audio/x-dec-adpcm
8       string  =AIFF   audio/x-aiff
8       string  =AIFC   audio/x-aiff
8       string  =8SVX   audio/x-aiff
0       string  =MThd   audio/unknown
0       string  =CTMF   audio/unknown
0       string  =SBI    audio/unknown
0       string  =Creative Voice File    audio/unknown
0       string  =RIFF
>8      string  =WAVE   audio/x-wav
0       string  =/* XPM image/x-xbm
0       string  =/*     text/plain
0       string  =//     text/plain
0       string  =^_\235 application/x-compress
0       string  =^_\213 application/x-gzip
0       string  =^_^^   application/octet-stream
0       short   =7967   application/octet-stream
0       short   =8191   application/octet-stream
0       string  =\377^_ application/octet-stream
0       short   =51973  application/octet-stream
0       string  =<MakerFile     application/x-frame
0       string  =<MIFFile       application/x-frame
0       string  =<MakerDictionary       application/x-frame
0       string  =<MakerScreenFon        application/x-frame
0       string  =<MML   application/x-frame
0       string  =<Book  application/x-frame
0       string  =<Maker application/x-frame
0       string  =<HEAD  text/html
0       string  =<head  text/html
0       string  =<TITLE text/html
0       string  =<title text/html
0       string  =<html  text/html
0       string  =<HTML  text/html
0       string  =<!--   text/html
0       string  =<h1    text/html
0       string  =<H1    text/html
0       string  =P1     image/x-portable-bitmap
0       string  =P2     image/x-portable-greymap
0       string  =P3     image/x-portable-pixmap
0       string  =P4     image/x-portable-bitmap
0       string  =P5     image/x-portable-greymap
0       string  =P6     image/x-portable-pixmap
0       string  =IIN1   image/x-niff
0       string  =MM     image/tiff
0       string  =II     image/tiff
0       string  =GIF94z image/unknown
0       string  =FGF95a image/unknown
0       string  =PBF    image/unknown
0       string  =GIF    image/gif
0       beshort =65496  image/jpeg
0       string  =BM     image/bmp
0       string  =\211PNG        image/png
0       string  =;;     text/plain
0       string  =
(       application/x-elc
0       string  =;ELC^S^@^@^@   application/x-elc
0       string  =Relay-Version: message/rfc822
0       string  =#! rnews       message/rfc822
0       string  =N#! rnews      message/rfc822
0       string  =Forward to     message/rfc822
0       string  =Pipe to        message/rfc822
0       string  =Return-Path:   message/rfc822
0       string  =Path:  message/news
0       string  =Xref:  message/news
0       string  =From:  message/rfc822
0       string  =Article        message/news
0       string  =\3767^@#       application/msword
0       string  =\333\245-^@^@^@        application/msword
0       string  =%!     application/postscript
0       string  =^D%!   application/postscript
0       string  =%PDF-  application/pdf
38      string  =Spreadsheet    application/x-sc
0       string  =\367^B application/x-dvi
0       string  =\input texinfo text/x-texinfo
0       string  =This is Info file      text/x-info
0       leshort =759    application/x-dvi
0       string  ={\rtf  application/rtf
0       string  =^@^@^A\263     video/mpeg
0       byte    =1      video/unknown
0       byte    =2      video/unknown
0       string  =DOC
>43     byte    =20     application/ichitaro4
>144    string  =JDASH  application/ichitaro4
0       string  =DOC
>43     byte    =21     application/ichitaro5
0       string  =DOC
>43     byte    =22     application/ichitaro6
2080    string  =Microsoft Excel 5.0 Worksheet  application/excel
2114    string  =Biff5  application/excel
0       string  =\224\246.      application/msword
0       belong  =834535424      application/msword
0       string  =PO^Q`  application/msword
0       string  =\320\317^Q\340\241\261^Z\341
>546    string  =bjbj   application/msword
>546    string  =jbjb   application/msword
512     string  =R^@o^@o^@t^@ ^@E^@n^@t^@r^@y   application/msword
2080    string  =Microsoft Word 6.0 Document    application/msword
2080    string  =Documento Microsoft Word 6     application/msword
2112    string  =MSWordDoc      application/msword
0       string  =\320\317^Q\340\241\261^Z\341   application/msword
0       belong  =435    video/mpeg
0       belong  =442    video/mpeg
0       beshort &65504  audio/mpeg
0       string  =MOVI   video/quicktime
4       string  =moov   video/quicktime
4       string  =mdat   video/quicktime
128     string  =PE^@^@ application/octet-stream
0       string  =PE^@^@ application/octet-stream
0       string  =LZ     application/octet-stream
0       string  =MZ
>24     string  =@      application/octet-stream
0       string  =MZ
>30     string  =Copyright 1989-1990 PKWARE Inc.        application/x-zip
0       string  =MZ
>30     string  =PKLITE Copr.   application/x-zip
0       string  =MZ
>36     string  =LHa's SFX      application/x-lha
0       string  =MZ
>36     string  =LHA's SFX      application/x-lha
0       string  =MZ     application/octet-stream
2       string  =-lh
>6      string  =-      application/x-lha
0       string  =PK     application/x-zip
257     string  =ustar^@        application/x-tar
257     string  =ustar  ^@      application/x-gtar
1.12

-- 

					ADVANTEST corp.
					Taiji.Can@xxxxxxxxxxxxxxxxxxx