Monday, July 4, 2011

Extracting text from pdf files

Recently found the mupdf package, which is a free pdf-viewer and utility package, at www.mupdf.com.
Binaries can be downloaded for Windows and Linux and looks like the Mac version can be built from source also.

On Ubuntu Linux it can also be installed using the package manager :
sudo apt-get install mupdf-tools

One of the utilities is called pdfdraw. It can render the pages of a pdf file in a few image formats.
The -t command line switch can be used to dump out the text from the pages to the standard output. From there it's simply the matter of redirecting to a file:
pdfdraw -t filename.pdf > filename.txt