DocsCorp today announced it is launching pdfDocs Content
Crawler OCR, a module in its new integrated analysis, reporting and
processing framework. The release will
integrate with the Autonomy iManage 8.2 or higher and
Opentext eDOCS 5.1.05 or higher DMS platforms. Further
DMS and content repository integrations will follow.

Documents often get profiled in the DMS through a variety of workflow loopholes: fax,
scanner and users profiling email attachments. These image-based
document workflows bypass the OCR processing that would make them
text-searchable. Once in the DMS, these documents become completely invisible to the search engines “Businesses have made considerable investments in DM and search technologies but it is estimated that
10-20% of documents in a DMS are non-searchable. This figure represents
a significant risk to any business. Its reputation and financial
well-being could be impacted simply by failing to produce a specific
document on demand,” says David Woolstencroft, DocsCorp President
Marketing, Sales & Strategy.

pdfDocs Content Crawler provides a framework
for searching an entire DMS database or a subset of documents based on
specific DMS queries. The Content Crawler OCR module identifies
non-searchable content in image files, PDF files and even looks inside
attachments to emails. The files are converted to text-searchable PDFs
using DocsCorp’s OCR technology and saved back into the DMS. Content
Crawler can search and convert backlogs of legacy documents as well as
actively monitoring newly-profiled documents. Woolstencroft adds “if you don’t know the
extent of the problem, or you are not sure if you have a problem,
DocsCorp invites you to use Content Crawler (trial version mode) to
provide an audit report of your DMS documents.“

Ben Mitchell, the recently appointed DocsCorp V-P EMEA added “The new product being released is called pdfDocs Content Crawler and is designed to address the problem of firms holding documents within their document management systems that do not contain searchable text and are therefore not discoverable. The risk implications of holding such documents are considerable for obvious reasons. In most cases the non-searchable content is PDF or TIFF documents that have been scanned or otherwise generated without a text layer. These documents often originate outside the firm and are received as email attachments which then get filed into the DMS. We have also found fax systems to be another major source of non-searchable documents.
 
“pdfDocs Content Crawler contains technology to analyse documents and establish whether they are searchable or not.  Documents that are found not to be searchable are then run through an OCR engine and the resulting searchable rendition of the document is added back into the firms DMS. We also believe the product will be a use in a litigation support context where firms are analysing electronic discovery bundles but do not know if all of the documents in the bundle are searchable.  Many firms have invested heavily in search technology and complex litigation support systems but if the documents these systems are pointed at do not contain searchable text their effectiveness is reduced.”
 
+ see attached PDF data sheet