namazu-ml(avocado)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML detection

From: Satoru Takabayashi <satoru@xxxxxxxxxxxxx>
Date: Thu, 26 Nov 1998 04:18:01 +0900
X-ml-name: namazu
X-mail-count: 01660
References: <199811250834.RAA05942@libpc01.pb.ascii.co.jp>

SHIOZAKI Takehiko <takehi-s@xxxxxxxxxxx> wrote:

>mknmzのロボット除けのところで、$DEFAULT_FILEが考慮されていないことに気づ
>きました。
>HTMLファイルの検出を何ヶ所も書くのも何なので、こっそりと(嘘)例のCGIのも
>含めてサブルーチンにしてみました。お試しください。

ありがたく使わせていただきます。

-- Satoru Takabayashi

[namazu:01621] の末尾に添付した Perl版わかち書きプログラムの改良版
です。

使い方

  1. まず最初に kakasidict を DBM化する (時間がかかります)
     % perl wakati.pl -i kakasidict
     % ls kakasidict.*
     kakasidict.dir kakasi.pag
  
  2. 一度 DBMを作ってしまえば後は普通に使うだけです
     % perl wakati.pl kakasidict < input > output

注意事項

  1. EUC-JPしか受け付けません
  2. 送り仮名の処理はしていません。

性能

  % time kakasi -w < manual.html > /dev/null 
  real    0m1.877s
  user    0m1.640s
  sys     0m0.240s
  
  % time perl wakati.pl kakasidict < manual.html > /dev/null 
  real    0m3.980s
  user    0m3.780s
  sys     0m0.200s

  どなたか高速化に挑戦してみませんか?


#!/usr/bin/perl
require 5.004;
use strict;
use IO::File;
use Fcntl;
use SDBM_File;
my $CHAR  = "(?:[\x21-\x7e]|[\xa1-\xfe][\xa1-\xfe])";
my $NONKANJI = "(?:[\x21-\x7e]|[\xa1-\xaf][\xa1-\xfe])";
my $KIGOU = "(?:[\xa1\xa2\xa6-\xa8][\xa1-\xfe])";
my $ALNUM = "(?:\xa3[\xa1-\xfe])";
my $CHOON    = "(?:[\xa1][\xbc])";
my $HIRAGANA = "(?:(?:[\xa4][\xa1-\xf3])|$CHOON)";
my $KATAKANA = "(?:(?:[\xa5][\xa1-\xf6])|$CHOON)";
my $KANJI    = "(?:[\xb0-\xfe][\xa1-\xfe]|\xa1\xb9)";
my %dict;

unless (defined($ARGV[0])) {
    print STDERR <<USAGE;
    usage: wakati [-i] <kakasidict>
       -i: convert  KAKASI dictionary into a DBM file
    example: cat hoge.txt | wakati kakasidict > kekka.txt

USAGE
    exit 1;
}

STDIN->autoflush(1);

if ($ARGV[0] eq "-i") {
    shift @ARGV;
    tie %dict, "SDBM_File", $ARGV[0], O_RDWR|O_CREAT, 0666 or
	    die "$!\n";
    load_dict($ARGV[0]);
    exit;
}

tie %dict, "SDBM_File", $ARGV[0], O_RDWR|O_CREAT, 0666 or
	    die "$!\n";

main();

sub main() {
    my $content = join('', <STDIN>);
    while (1) {
#	if ($content =~ /\G($KANJI(?:$KANJI|$HIRAGANA)*)(\s*)/gc) {
	if ($content =~ /\G($KANJI+)\s*/gc) {
	    print wakatize($1), $2 ? $2 : " ";
	} elsif ($content =~ 
		 /\G
		 ([\x21-\x7e]+|$HIRAGANA+|$KATAKANA+|$ALNUM+|$KIGOU+|\S+)
		 (\s*)
		 /gcx) 
	{
	    print $1, $2 ? $2 : " ";
	} elsif ($content =~ /\G(\s+)/gc) {
	    print $1;
	} else {
	    last;
	}

    }
}
untie %dict;

sub wakatize($) {
    my ($string) = @_;
    my $rest_string = $string;
    my @parts = ();

    if (length($string) <= 4) { # too short to wakatize
	return $string;
    }
    while (length($rest_string) > 0) {
	my $tmp = $rest_string;
	my $try = "";
	my $matched_part;

	# get the longest match
	while ($tmp =~ /\G($CHAR)/gc) { 
	    $try .= $1;
	    if (defined($dict{$try})) {
		$matched_part = $try;
	    } 
	}
	if (defined($matched_part)) {  # matched!
	    $rest_string =~ s/^$matched_part//;
	    push(@parts, $matched_part);
	} else {
	    last;
	}
    }
    push(@parts, $rest_string) if $rest_string;
    join(' ', @parts);  # return with value
}

sub load_dict($)
{
    my ($dictfile) = @_;
    my ($fh) = new IO::File;

    $fh->open("$dictfile") || die "$!: $dictfile\n";
    while (<$fh>) {
	next if /^;/;

	/^(.*?) +(.*)$/;
	if (defined($dict{$2})) { # conflict
#	    print STDERR "'$2 ($1)' is already defined as '$2 ($dict{$2})'!\n";
	    next;
	}
	$dict{$2} = 1;
    }
}

Follow-Ups:
- wakatigaki with perl (Re: HTML detection)
  - From: Satoru Takabayashi

References:
- HTML detection
  - From: SHIOZAKI Takehiko

Prev by Date: Re: <UL> inside <P>
Next by Date: wakatigaki with perl (Re: HTML detection)
Previous by thread: HTML detection
Next by thread: wakatigaki with perl (Re: HTML detection)
Index(es):
- Date
- Thread