Skip to content

Data Access Security Text Search

The Data Classification engine uses Lucene as its primary, text-based optimized database. The Lucene database provides term-based search capabilities, based on the textual content extracted and analyzed from indexed files.

To extract textual content from various file types and formats, the classification engine uses a proprietary text extraction library, which is able to extract the file content based on its type. Based on the extracted content, the Lucene indexing service parses and analyzes file content into an index of searchable terms. The full content of the files itself is not saved as part of the index, which allows the index to remain relatively small, highly efficient, and optimized for term-based textual searches.

When the system compares a Content Classification request with the textual index, it parses and translates the various policy rules into term-based search queries. Query results representing files that correspond to the rules’ requirements, consist of file names, extensions, and full path locations, along with other attributes. In certain cases, results may include an actual term or phrase that matches the rule-based query, rather than the full content of the file.


Regular-expression-based rules involves matching regular-expression patterns with a file during the process of reading the file content and not comparing the pattern with a term-based index.

Lucene's Indexing Process

While it parses and analyzes the content data, the Lucene index analyzer eliminates white spaces, certain punctuation characters, and “stop-words” from the content. Stop-words are a predetermined set of frequently used words with diminished semantic significance, such as pronouns and prepositions. Lucene filters stop-words to keep the index manageable, to eliminate “white noise,” and to improve search heuristics. Lucene analyzes and tokenizes file content into searchable terms based on the white spaces and stop-words omitted from the original text. The tokenizing algorithm affects Data Classification policy rules.

Multi-term Phrase-Based Rules

The Data Classification engine allows both single-term keyword searches and multi-term phrase searches. Lucene omits any “stop-word” contained in a multi-term search phrase. For example, a rule containing the phrase “It was the best of times, it was the worst of times,” will classify the file containing the entire sentence, as well as any file containing a contiguous phrase, such as “best times worst times.” To avoid possible false-positive classification, it is best to restrict multi-term phrase searches to meaningful, contiguous terms.

Chinese and Logogrammatic Languages

Some scripts, such as Chinese, represent words by symbols (logograms) and a single word may consist of one or more logograms. Furthermore, while most languages use white spaces to separate words, Chinese, as well as other logogrammatic scripts, often do not separate words by spaces. The combination of these two phenomena, along with Lucene’s omission of white spaces, will cause phrase searches in Chinese (with multiple logograms, separated by spaces) to return positive matches for files containing the same sequence of logograms, regardless of the spaces between them. Thus, a rule containing the phrase “莦 莚 虙贄 蹝 轈”, will classify files containing phrases that consist of these logograms, regardless of spaces. Therefore, the phrases “莦莚虙贄蹝轈,” “莦莚虙贄蹝轈”or “莦莚 虙贄 蹝轈”,and “莦 莚 虙贄 蹝 轈,” will all be classified by the rule defined above. However, single term keyword searches of words consisting of multiple logograms, and phrases not separated by spaces, will return correct, exact match results: a rule containing the term “莦莚虙” will only classify files containing that exact term. If more complex phrases are required, a rule containing multiple phrases with the “Contains All” operator will give the desired results.