
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ps/pdf to text


ご存知の方も多いと思いますが、一応 FYI です。

久しぶりに fj.unix を見ていたらこんなんがあるそうです。

pstotext is a program that works with Ghostscript (version 3.33 or later) to extract 
plain text from PostScript and PDF files (you
should have Ghostscript 3.51 or later for PDF). 

There are other programs to do this, but they all fail on numerous common examples. 
We've tested pstotext on a wide variety of
PostScript files, and for each of them pstotext could extract the words correctly 
(though sometimes the word order was a bit
strange). If you find a PostScript file (generated by a commercial or widely 
available program) for which pstotext can't do this,
please tell us. The results with PDF are useful, but a little less reliable. 

pstotext works by sending a library, followed by the PostScript file, to the 
Ghostscript interpreter. The library intercepts the text
rendering operators and sends information about the text back to pstotext. This 
information includes character metrics and
encoding vectors, so in most situations we're able to reconstruct the plain text 
(converted to ISO Latin 1 encoding), with correct
word breaks and good guesses about line breaks. It even works for rotated text! 

Here's the pstotext documentation, to give you a better idea of what it does. 


   Best regards,
Ken-ichi Hirose (^^)k!
e-mail: hirose@xxxxxxxxxxxxxxxxxxxx