How to translate a scanned document
Having an editable source file to work with is a well-known aspect of the translation industry, but it often goes undiscussed.
Editable files can easily be translated while still maintaining the original format and allows a translator to take advantage of all the quality and productivity tools at their disposal (Translation Memories, Glossaries, Do Not Translate lists, automated QA tools, etc.).
However, sometimes the editable source file just doesn’t exist, and we must deal with that.
There can be many reasons for this, perhaps it is a scan of a document that only exists in physical form (e.g., a scan of an old document).
It could also be the case that the PDF is an export from a computer system (e.g., an accounting software). Last but not least, it is possible that for one reason or another the source files exist but are not accessible. For example, a graphic designer has them, but you no longer work with this designer.
In this article we will discuss how to translate a scanned document, go over some of the issues of working with scanned copies and show 3 alternatives to translate your content.
Not having an editable source file generates the following problems
It disconnects the file from all reference material and quality tools
All modern CAT tools allow translators to have Translation Memories (with all the completed translations for a specific account so far), Glossaries, and DNT lists connected. The translator can therefore see in real time what would be the best term to use.
There are also QA tools that serve to aid in making sure that the translator is translating the same term in a consistent manner. These tools cannot be used without first having an editable source file available.
Simply translating directly to a different file (commonly a MS Word file) results in a loss of all these features and has a negative impact on the final quality of the translation.
It lowers productivity
Having to “alt+tab” every couple of seconds dramatically slows down the translation process. This, in turn, means less gets done each day, which negatively impacts both cost and delivery times.
Character corruption
There are many reasons why a character gets corrupted or misinterpreted as an incorrect character. For example, some scanned documents have a very low resolution and are difficult to convert. This can cause the OCR software to misinterpret certain characters, such as mistaking a “g” as an “8” or an upper case “O” as a “0”.
Documents that were old and grainy before conversion or have blemishes on the paper can also generate issues. Blemishes tend to be interpreted as characters by the OCR software.
Finally, a poorly made scan (e.g., a scan where the paper was not set straight) can also lead to characters being misinterpreted.
Format loss during OCR
While character recognition software does its best to match the original format, it is usually not enough, and the final product tends to have many flaws.
Even when the format appears to be flawless, it only remains so provided nothing is changed. As soon as the document is translated it all breaks down.
Translate a scanned document: 3 possible solutions
We have explored the many problems that scanned copies tend to have when used as source files for translations. Now we will explore some alternatives used to tackle these issues in order to translate your scanned documents, with all their pros and cons included.
Full re-creation
The file is fully recreated from scratch using a professional Graphic design software, for example Adobe InDesign or Illustrator. The end result matches 100% with the formatting of the original scan and can be used for professional printing.
Pros: the finished product is suitable for all uses and can be sent for printing and can be used to create high resolution PDFs. You also get to keep the file that was originally recreated, in case you want to translate it to a different language.
Cons: it is more costly, and it takes more time to recreate. If a high-quality format is not needed, it may not be worth it from a cost/benefit standpoint.
Light re-creation/Simple OCR
The file is recreated with OCR software to MS Word and only the major differences are corrected. This results in a file that is similar to the original but not an exact copy.
For this option to work it is important that the format of the original is relatively simple, and that there aren’t many tables on the document.
If it is acceptable to deliver a file that has a similar format to the original, but has small differences, this could be the right option for you.
Pros: Lower cost, faster to produce. The results are good enough to be used electronically and can be published on a website or printed on a home printer.
Cons: there will be slight differences with the original, the result won’t have a high resolution and will not be suitable to be sent to a professional printing service.
Translating directly to a Word file
As mentioned before this method is fairly time consuming and has all the problems discussed above. The formatting of the original is often completely lost.
Pro: a reasonable option when format is not important at all. It is the only solution for handwritten documents
Cons: the additional effort impacts the cost and time, the format is lost.
Conclusion
Translating scanned documents isn’t impossible, it just requires a bit more work. With the right type of recreation, time, and budget, even the grainiest of scans can be translated.
Part of our nature at Australis is to provide solutions, even when the initial conditions are not ideal. If you have scanned documents you need to translate, please send us an email to production@australis-localization.com and let’s discuss how we can help you.