Test results compare Abbyy FineReader and Cognitive ScanPack.Text identifier separated by purple .
When using MRC technology, the image before compression has undergone a stage called " layer separation ": in the figure, the structure details are separated into three forms (text, image (image, diagram, graphs, etc., and areas that are shared in one color, then these layers will be handled independently by compression algorithms, and in Abbyy's solutions with technology application ADRT (Adaptive Document Recognition Technology) adaptive identification allows processing complex formatted documents.
The experiment compared the Abbyy FineReader with the Cognitive ScanPack.Text identifier separated by purple .
Vladimir Arlazarov, head of the Laboratory of the Moscow Institute of Theoretical and Practical Physics, said that the PDF / A format for compressing images and storing documents is actually being used by many developers in products. and its technology. In particular, the application of MRC technology (Mixed Raster Content) is an extension of the approach used in DjVu format. While using MRC, geometric fragmentation using identification technology is implemented, in which images are separated into graphic layers (paintings and text) using different compression algorithms.
According to Arlazarov, in this approach there is a major drawback: If the system cannot recognize the object (text on the picture, seal or signature on printed text, poor copy quality, books or newspapers fall yellow) it will be processed as an image and will not be able to perform a search according to the document content after it is processed.
Arlazarov explains that, in Cognitive ScanPack, there is a geometry and color defragmentation application, which allows splitting of documents into several layers of information, so that the text can be processed in case of overlap and overwriting. In positions that are crossed out or disturbed by copying, stain . Separating documents into layers does not depend on each other important in the process of document processing, in which the paper background is as significant as in case of passport handling.
Also, according to Arlazarov, "the binary methods used to recover ScanPack's text increase the image quality of the text on the final document compared to the original document ". After that, each layer of information is processed more effectively by compression algorithm (text is compressed in TIFF format while images are in JPG format).
Cognitive Technologies Nikolai Nikolsky's vice president of marketing asserts that ScanPack-based products will not compete directly with Abbyy's solutions. Meanwhile, Vladimir Arlazarov added that by default, ScanPack uses Cuneiform-aware cores, but if desired, users can also connect to Abbyy FineReader systems.
Interestingly, because ScanPack knows how to identify and separate seal images and signatures, it is inadvertently "abetting " for forging paper documents. Vladimir Arlazarov admits that, with the mass market of ScanPack-based products, document falsification will become easy. However, he also said that those who use mature Photoshop software can do it.
Arlazarov said developers are trying to remove the risk of abusing technology by adding some recognizable finished product documentation, or reducing the quality of signatures, the seal has just been re-released. create.
Cognitive said, currently the ScanPack system is being used in two insurance companies "Zurich Insurance " and "Renessans Insurance " and Magnhitogorsky Metallurgy Plant, and possibly in the armed forces.
Nikolai Nikolsky said that the solutions on the Cognitive ScanPack platform will be sold in bulk in 2011. The total market value of Russian " document processing systems " is valued at USD 1 billion by Nikolsky (VND 20,833 billion). . Nikolai Nikolsky said that because there are no equivalent systems, Cognitive ScanPack can capture a significant market share in the world market.
Another interesting thing is that ScanPack is mainly based on open technologies: Cuneiform identification core developed by Cognitive and published in 2008 under a free license of BSD and PDF / A is a subset of PDF that has been standardized. chemical in the ISO system. Image recognition and processing components are still within license frameworks, according to Cognitive.