I have just done a search on Google as to how I would convert a scanned document (of a typescript) into a document that would recognize the characters just like any other Word document. But ofcourse I went and forgot that I am using Ubuntu and not Windows. So is it still possible somehow to do the same on Ubuntu is what I am wondering. I would really appreciate any help.
Thank you.
2 Answers
Tesseract is one option that worked great for me!
I used it as follows:
Install it, if you don't have it with:
sudo apt-get install tesseract-ocrThen:
Convert the .JPG scanned file to .tif (this is the format Tesseract
requires). This is done with ImageMagick as follows:convert foo.JPG foo.tifNow simply let Tesseract do it's magic:
tesseract foo.tif foo(will save output to foo.txt)
I recently had to convert an old manual with multiple(36) pages to something digital. I whipped up a BASH script to do it.
Code here:
#!/bin/bash
# makeDoc.sh
# Turn a set of scanned JPG pages into a single document file.
# Requires the ImageMagick and Tesseract packages.
# Author: Fred Fury
echo "makeDoc.sh"
echo "Convert a set of scanned JPG pages into a single document file."
echo "Starting up..."
for i in {01..36}
do echo "converting $i.JPG to $i.tif..." bash -c "convert $i.JPG $i.tif" # Convert the file to tesseract usable format bash -c "tesseract $i.tif $i &>-" # Convert the tif to txt
done
echo "Merging files into Output.doc"
bash -c "cat *.txt > Output.doc" # Merge all the generated txt files into a single file
echo "Done."Also check out this page for some other solutions:What's the best, simplest OCR solution?This is where I found tesseract.
Hope that helps!
I had a similar problem to this a while ago. Try uploading the file to online-convert.com. It will take a while, but the webapp can handle just about any format. Good luck!
2