Ediscovery: OrcaTec granted patent on language modeling concept search
OrcaTec’s Herbert Roitblat has been granted another patent, this one with OrcaTec for the use of language modeling as a basis for concept search. Concept search seeks out ideas in context in large data collections rather than using the more common but notoriously less effective keyword search method.
Roitblat, CTO & Chief Scientist for OrcaTec and the architect for the OrcaTec Document Decisioning Suite, built the Suite using the language modeling on which US Patent No. 8,401,841 is based. “OrcaTec’s predictive coding, advanced analytics, review – all the pieces of the Document Decisioning Suite are based on language modeling and using words in context,” said Dr. Roitblat. “The ideas embodied in this patent are the building blocks that make our Suite work so well. We are gratified that the US Patent Office finds it to be markedly unique.”
What is concept search?
One of OrcaTec’s commonly used examples of concept search in context is the word “court.” “If you hear the word ‘court,’ you don’t know what that means,” says Roitblat. “If you hear ‘court, blah, blah, basketball’ or ‘court, blah, blah, judge,’ then you know what ‘court’ means. We use the same kind of process to understand what words mean in the documents that we’ve indexed. This eliminates the ambiguity you have in keyword search.”
Language modeling also uses a fill-in-the-blank probabilistic approach. In the patent application, Dr. Roitblat noted Individual documents contain only a limited and fixed set of words, but there are often many other words that could have appeared in the same place. People write sentences like “The boy skateboarded down the street.” People do not write sentences like “The [boy, child, youth, young person, kid] skateboarded down the street [alley, road, parkway, boulevard, blacktop].” The language modeling method recognizes that any of the words in brackets could have been used, but in any particular sentence only one of the words is used.
A given query and a given document typically use only one of the alternatives from the distribution of words, but it is difficult to predict which one. As a result, OrcaTec’s language modeling method searches for a distribution of words, rather than just the specific words in the query.
Additionally, the language modeling method of concept search that Dr. Roitblat has patented does not use dictionaries, ontologies or thesauri. Instead it analyzes the way words are used in the context of the entire document set being reviewed by using the probability of word co-occurrence within a paragraph. Thus, it is language agnostic, including non-alphabetic languages. “OrcaTec has successfully used our concept search and predictive coding in many languages, including Arabic, German, Japanese and, of course, English,” adds Roitblat.
COMMENT: Ediscovery has always been a hotbed of patent claims and controversies. Last week (16 March) saw section 3 of the America Invents Act come into effect. This has changed the focus on patent registration from first-to-invent to first-inventor-to-file.