tl;dr:

$ brew install tesseract
$ brew install tesseract-lang
$ convert -density 150 ./textbook.pdf ./textbook.jpg
$ /bin/ls textbook*.jpg | sed 's/^/.\\//' | sort -t - -k 2 -g > ./input.txt
$ tesseract ./input.txt hsk2txtocr_jpg -l eng+chi_sim PDF

ntl;wr:

Install Tesseract

Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.

$ brew install tesseract
$ tesseract --help

It only works on images (not on entire documents like PDFs), so we’ll address that in a later step.

Install Tesseract language packs

By default, Tesseract doesn’t come with any non-English language packs.

$ brew install tesseract-lang

Use ImageMagick to extract your images

$ brew install imagemagick # i think this will do it!
$ brew install ghostscript # allows convert to handle PDFs, probably
$ convert -density 150 ./book.pdf ./book.jpg

I really don’t know the full gamut of options available in ImageMagick, but you can end up with some truly gigantic PDFs if you don’t spend some time tweaking them. Generally, you can sum up the size of your images and you can get an idea of how huge your resulting PDF will be.

OCR the images and stuff into a PDF

$ /bin/ls book*.jpg | sed 's/^/.\\//' | sort -t - -k 2 -g > ./input.txt

Get a listing the images to stuff into your PDF; my version of ls gave me an xargs style output and I didn’t want to read the manpages, so I just formatted it the way I wanted. That sort command is there so that the jpgs (which convert spits out as book-1.jpg, book-2.jpg, etc) are in the correct order.

Spit it into a text file, because that’s what tesseract expects if you’re outputting a PDF (as far as I know!).

Convert!

$ tesseract ./input.txt hsk2txtocr_jpg -l eng+chi_sim PDF

Specify the language with -l LANG[+LANG] - for a listing of all the languages, use tesseract --list-langs (but the tesseract-lang package has to be installed for them to work!).