Announcing a command-line PDF to text converter
Posted by: Jamal Mazrui
Date Mailed: Thursday, April 22nd 1999 09:06 PM
Date Mailed: Thursday, April 22nd 1999 09:06 PM
I've posted the archive http://www.empowermentzone.com/pdf2txt.zip It is about 3 megabytes in size. The included readme.txt file is below. Feel free to use, share, or improve this work, which is based on open source components. Regards, Jamal ---------- PDF2TXT is a command-line system for converting one or more files from portable document format (.pdf) to plain text (.txt). It only works when run from a DOS box under Windows. Suggested installation is as follows. Create a directory called c:\pdf2txt and unarchive all files from pdf2txt.zip into it. Add this directory to the DOS search path, e.g., within the c:\autoexec.bat file. Run the batch file pdf2txt.bat to convert files from .pdf to .txt format (leaving the PDF input files intact). If pdf2txt.bat is run without any parameters (or with a /h or /? parameter), it presents a help screen that explains command syntax. Suppose the file manual.pdf is to be converted from portable document format to plain text. Enter the following command at the DOS prompt: pdf2txt manual Suppose all .pdf files in the current directory are to be converted to text. Enter the command: pdf2txt * Please note that the conversion process is relatively slow, that some .pdf files produce no readable text, and that those that do usually require additional manual formatting to be presentable. PDF2TXT is a combination of free utilities and batch files, offered without technical support. Feedback is welcome, but a response should not be expected. I hope it makes the task of reading PDF files a little easier. Jamal Mazrui Email: empower@smart.net ADDITIONAL NOTES If the conversion to text is unsatisfactory with pdf2txt.bat, another conversion option is to email the PDF as a binary attachment to the address pdf2txt@sun.trace.wisc.edu or pdf2html@sun.trace.wisc.edu, according to whether a reply is desired in text or HTML format. Conversion is also possible using a web based form at the address http://access.adobe.com where one enters the Internet address of a PDF to be rendered in the active browser. pdf2txt.bat uses the batcon.exe utility for support with batch file operations. For the actual conversion, it calls pstotext.exe (Post Script to Text), which in turn calls gs.exe (Ghost Script), as well as dos4gw.exe for a DPMI (DOS Protected Mode Interface) environment. The line length in converted .txt files may be longer than 80 characters, necessitating a document viewing program that wraps lines appropriately. pdf2txt.bat can be run from the directory in which it is installed if one does not wish to include this directory in the DOS path. There are many files in that directory, however, so it is more difficult to review what files were converted. The only directories required in the DOS path when running pdf2txt.bat from another directory are its installation directory and the directory in which Windows is installed (generally c:\windows). If pdf2txt.bat is interrupted before it completes processing, one or more of the PDF files may be left with temporary .tmp extensions rather than the original .pdf extensions. Below is the help screen displayed by running pdf2txt.bat without parameters: Converts a file in portable document format with a .pdf extension to a text file with the same base name and a .txt extension. Supply the base name of the input file as a parameter, e.g., pdf2txt basename produces the output file basename.txt from the input file basename.pdf Multiple files may be processed with a single command by using the * and ? wild card characters. For example, the command pdf2txt * will convert all .PDF files in the current directory. Remember that a .pdf is automatically added to the file specification provided on the command line, so that *.pdf is actually being processed in this case. Copyright 1999 by Access Success ---------- End of Document -- TNET Mail-To-News Gateway Version - 1.6 For information about this gateway email programs@tnet.com

