namazu-ml(ring)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Comment



Koji Kishi <kis@xxxxxxxxxxxxxx> wrote:

>Namazu v1.3.0.11 で気がついたんですが(前からそうだったかも)、
>HTML に次のようなコードが入っていると
(snip)
>7行目の "=3 ) { return true;}" 以降を本文として扱ってしまうようです。
>コメントを
>	<!-- から >
>
>までにしてるのかなあ。

あ、すみません。HTMLは正規表現でいい加減に除去しています。と
りあえず、mknmzの

  sub erase_html_tags ($) {
      my ($contents) = @_;
  
      1 while ($$contents =~ s/<\/?([^<>]*)>/tag_to_space_or_null($1)/ge);
  }

なる関数を

  sub erase_html_tags ($) {
      my ($contents) = @_;

      $$contents =~ s/<!--.*?-->//gs;  # これを追加
      1 while ($$contents =~ s/<\/?([^<>]*)>/tag_to_space_or_null($1)/ge);
  }

すれば回避できると思います (完全ではないですけど)。開発中の
2.0 では上記の処理を行っています。

ちなみに、HTMLタグの除去は正規表現では正確には行えないことが
知られています。

From: Tom Christiansen <tchrist@xxxxxxxxxxxx>
Newsgroups: comp.lang.perl.misc
Subject: Re: Can't Match Multi-Line Pattern
Date: Fri, 7 Aug 1998 22:38:08 JST
Message-ID: <6qf000$8b4$1@xxxxxxxxxxxxxxxxxxxxxx>

| Question:     Assuming $_ contains HTML, which of
|             the following substitutions will remove all tags in it?
| Type:       Regular Expressions, WWW
| Difficulty: 6/7 (Hard)
| 
| Answer:     You can't do that.
| Correct:    Yes.
| Why:        If it weren't for HTML comments, improperly formatted
|             HTML, and tags with interesting data like <SCRIPT>, you 
|             could do this.  Alas, you cannot.  It takes a lot
|             more smarts, and quite frankly, a real parser.
| 
| 
| Answer:     s/<.*>//g;
| Correct:    No.
| Why:        As written, the dot will not cross newline boundaries, and the 
|             star is being too greedy.  If you add a /s, then yes,
|             it will remove all tags -- and a great deal else besides.
| 
| Answer:     s/<.*?>//gs;
| Correct:    No.
| Why:        It is easy to construct a tag that will cause this to fail,
|             such as the common `<IMG SRC='foo.gif' ALT="> ">' tag.
| 
| Answer:     s/<\/?[A-Z]\w*(?:\s+[A-Z]\w*(?:\s*=\s*(?:(["']).*?\1|[\w-.]+))?)*\s*>//gsix;
| Correct:    No.
| Why:        For a good deal of HTML, this will actually work, but
|             it will fail on cases with annoying comments, poorly formatted
|             HTML, and tags like <SCRIPT> and <STYLE>, which can contain
|             things like `while (<FH>) {}' without those being counted
|             as tags.  Comments that will annoy you include
|                     <!-- <foo bar = "-->">
|             which will remove characters when it shouldn't; it's just
|             a comment followed by `">'.  And even something like this:
|                     <!-- <foo bar = "-->
|             Most browsers will get right, but the substitution will not.
|             And if you have improper HTML, you get into even more
|             trouble, like this:
|                 <foo bar = "bleh" @>
|                 text text text
|                 <foo bar = "bleh">
|             in which case the .*? will gobble up way more than you
|             thought it would.

-- Satoru Takabayashi