Namazu-devel-ja(旧)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML splitting



Satoru Takabayashi <satoru-t@xxxxxxxxxxxxxxxxxx> wrote:

>だけです。ひとまず「HTMLファイルを処理する際に適切に分割する」
>部分を独立のプログラムとして書いてみます。

さっそく、いい加減なツールを作りました。お試しください。

  % ls
  htmlsplit    manual.html

  % perl htmlsplit < manual.html 

  % ls
  TOP.html              mailutime.html        query-notes.html
  bnamazu.html          manual.html           query-or.html
  cgi.html              mknmz-option.html     query-phrase.html
  components.html       mknmz.html            query-regex.html
  default-index.html    mknmzrc.html          query-substring.html
  doc-filter.html       namazu-option.html    query-term.html
  form-idxname.html     namazu.html           query.html
  form-idxnames.html    namazurc.html         rfnmz.html
  form-lang.html        nmzgrep.html          setting.html
  form-subquery.html    query-and.html        template.html
  form.html             query-field.html      tools.html
  gcnmz.html            query-grouping.html   vfnmz.html
  htmlsplit             query-not.html

  % lynx TOP.html

# manual.html 以外ではテストしていません

-- Satoru Takabayashi

#! /usr/local/bin/perl
use strict;
use FileHandle;

my $Header = << 'EOS';
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>${subject}</title>
</head>
<body>
<h1>${subject}</h1>
<hr>
EOS

    my $Footer = << 'EOS';
<hr>
<address>
    ${author}
</address>
</body>
</html>
EOS

my $cont   = join '', <>;
my $Title  = get_title(\$cont);
my $Author = get_author(\$cont);

my $Name   = "TOP";

$cont =~ s/(<a\s[^>]*href=(["']))#(.+?)(\2[^>]*>)/$1$3.html$4/g;
$cont =~ s/\G(.+?)<a\s[^>]*name=(["'])(.+?)\2[^>]*>/
           write_partial_file($1, $3)/sgex;

write_partial_file($cont, "");

sub get_title ($) {
    my ($contref) = @_;
    my $title = undef;
    
    if ($$contref =~ s!<TITLE[^>]*>([^<]+)</TITLE>!!i) {
	$title = $1;
	$title =~ s/\s+/ /g;
	$title =~ s/^\s+//;
	$title =~ s/\s+$//;
    } else {
	$title = "no title";
    }

    return $title;
}

sub get_author ($) {
    my ($contref) = @_;

    my $author = "unknown";

    # <LINK REV=MADE HREF="mailto:ccsatoru@xxxxxxxxxxxxxxxxxx">

    if ($$contref =~ m!<LINK\s[^>]*?HREF=([\"\'])mailto:(.*?)\1\s*>!i) { #"
	$author = $2;
    } elsif ($$contref =~ m!.*<ADDRESS[^>]*>([^<]*?)</ADDRESS>!i) {
	my $tmp = $1;
#	$tmp =~ s/\s//g;
	if ($tmp =~ /\b([\w\.\-]+\@[\w\.\-]+(?:\.[\w\.\-]+)+)\b/) {
	    $author = $1;
	}
    }
    return $author;
}

sub write_partial_file($$) {
    my ($cont, $name) = @_;
    my $fname = "$Name.html";

#    print STDERR "$fname\n";

    my $fh = new FileHandle;
    $fh->open(">$fname") || die "$fname: $!";

    my $header = $Header;
    $header =~ s/\$\{subject\}/$Title: [$Name]/g;
    print $fh $header;
    print $fh $cont;

    my $footer = $Footer;
    $footer =~ s/\$\{author\}/$Author/g;
    print $fh $footer;

    $Name = $name;
}