$ brew install tesseract
$ brew install tesseract-lang
$ convert -density 150 ./textbook.pdf ./textbook.jpg
$ /bin/ls textbook*.jpg | sed 's/^/.\\//' | sort -t - -k 2 -g > ./input.txt
$ tesseract ./input.txt hsk2txtocr_jpg -l eng+chi_sim PDF
Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.
$ brew install tesseract
$ tesseract --help
It only works on images (not on entire documents like PDFs), so we’ll address that in a later step.
By default, Tesseract doesn’t come with any non-English language packs.
$ brew install tesseract-lang
$ brew install imagemagick # i think this will do it!
$ brew install ghostscript # allows convert to handle PDFs, probably
$ convert -density 150 ./book.pdf ./book.jpg
I really don’t know the full gamut of options available in ImageMagick, but you can end up with some truly gigantic PDFs if you don’t spend some time tweaking them. Generally, you can sum up the size of your images and you can get an idea of how huge your resulting PDF will be.
$ /bin/ls book*.jpg | sed 's/^/.\\//' | sort -t - -k 2 -g > ./input.txt
Get a listing the images to stuff into your PDF; my version of ls
gave me an xargs
style output and I didn’t want to read the manpages, so I just formatted it the way I wanted. That sort
command is there so that the jpgs (which convert
spits out as book-1.jpg
, book-2.jpg
, etc) are in the correct order.
Spit it into a text file, because that’s what tesseract
expects if you’re outputting a PDF (as far as I know!).
$ tesseract ./input.txt hsk2txtocr_jpg -l eng+chi_sim PDF
Specify the language with -l LANG[+LANG]
- for a listing of all the languages, use tesseract --list-langs
(but the tesseract-lang
package has to be installed for them to work!).