Categories
Latest News

Ediscovery: OrcaTec granted patent on language modeling concept search

OrcaTec’s Herbert Roitblat has been granted another patent, this one with OrcaTec for the use of language modeling as a basis for concept search. Concept search seeks out ideas in context in large data collections rather than using the more common but notoriously less effective keyword search method.

OrcaTec’s Herbert Roitblat has been granted another patent, this one with OrcaTec for the use of language modeling as a basis for concept search. Concept search seeks out ideas in context in large data collections rather than using the more common but notoriously less effective keyword search method.

Roitblat, CTO & Chief Scientist for OrcaTec and the architect for the OrcaTec Document Decisioning Suite, built the Suite using the language modeling on which US Patent No. 8,401,841 is based.  “OrcaTec’s predictive coding, advanced analytics, review – all the pieces of the Document Decisioning Suite are based on language modeling and using words in context,” said Dr. Roitblat.  “The ideas embodied in this patent are the building blocks that make our Suite work so well.  We are gratified that the US Patent Office finds it to be markedly unique.”

What is concept search?

One of OrcaTec’s commonly used examples of concept search in context is the word “court.”  “If you hear the word ‘court,’ you don’t know what that means,” says Roitblat.  “If you hear ‘court, blah, blah, basketball’ or ‘court, blah, blah, judge,’ then you know what ‘court’ means. We use the same kind of process to understand what words mean in the documents that we’ve indexed. This eliminates the ambiguity you have in keyword search.”

Language modeling also uses a fill-in-the-blank probabilistic approach. In the patent application, Dr. Roitblat noted Individual documents contain only a limited and fixed set of words, but there are often many other words that could have appeared in the same place. People write sentences like “The boy skateboarded down the street.” People do not write sentences like “The [boy, child, youth, young person, kid] skateboarded down the street [alley, road, parkway, boulevard, blacktop].” The language modeling method recognizes that any of the words in brackets could have been used, but in any particular sentence only one of the words is used.

A given query and a given document typically use only one of the alternatives from the distribution of words, but it is difficult to predict which one. As a result, OrcaTec’s language modeling method searches for a distribution of words, rather than just the specific words in the query.

Additionally, the language modeling method of concept search that Dr. Roitblat has patented does not use dictionaries, ontologies or thesauri. Instead it analyzes the way words are used in the context of the entire document set being reviewed by using the probability of word co-occurrence within a paragraph. Thus, it is language agnostic, including non-alphabetic languages. “OrcaTec has successfully used our concept search and predictive coding in many languages, including Arabic, German, Japanese and, of course, English,” adds Roitblat.

COMMENT: Ediscovery has always been a hotbed of patent claims and controversies. Last week (16 March) saw section 3 of the America Invents Act come into effect. This has changed the focus on patent registration from first-to-invent to first-inventor-to-file.

3 replies on “Ediscovery: OrcaTec granted patent on language modeling concept search”

If it does not use use dictionaries, ontologies or thesauri how does it come up with the words to substitute in the brackets? Or is that not really the point? Is the magic that it does not need to know what words to substitute?

Hi John – I’m trying to get an answer for you on this point (I thought ontology was something to do with dentists!)

OrcaTec have now replied…

That is the magic, Charles. OrcaTec learns words in context within the individual document set through indexing. Every document set is a new set of relationships, That’s why it works in Japanese and Arabic and whatever else we put in front of it. Completely language agnostic. It would work in a numeric language, too, apparently.

It sounds unlikely, but it’s very mathematical, and when engineers see the formulas/algorithms there’s a lot of “aha!”

It works on the same principles as the dolphin biosonar that Herb used to study. He just morphed those concepts into the word game.

And that unique approach is why he got the patent!

Our predictive coding works the same way. The system randomly generates documents and learns “he wants them to look like this, not like that.” Basketball court documents out, judicial court documents in. It doesn’t have a definition of the word per se, it just learns what the word means in the context in which it is presented.

Comments are closed.