Skip to content

Data Access Security Text Search

Data Access Security Data Classification Collectors leverage Lucene as its primary text-based optimized search and analysis. The Lucene engine provides term-based search capabilities, based on the textual content extracted and analyzed from scanned files.

To extract textual content from various file types and formats, the Data Access Security Data Classification Data Collectors uses a proprietary text extraction library, which is able to extract the file content based on its type. Based on the extracted content, the Lucene engine parses and analyzes file content into searchable terms. The full content of the file itself is discarded and is never persisted. All textual evaluations are done in memory on highly efficient textual analytics structures, optimized for term and phrase-based textual searches.

When Data Access Security evaluates the scanned and analyzed content based on the classification policies and rules, it parses and translates the various policy rules into term-based search queries. Query results representing files that correspond to the rules’ requirements, consist of file names, extensions, and full path locations, along with other attributes. In certain cases, results may include a masked text snippet to serve as evidence to orient admins and stakeholders reviewing the results. Strict masking requirements apply to all evidence snippets.

Regular-Expressions

Regular-expression-based rules involves matching regular-expression patterns with a file during the process of reading the file content and not comparing the pattern with a term-based index.

Lucene's Analysis Process

While it parses and analyzes the content data, the Lucene analyzer eliminates white spaces, certain punctuation characters, and “stop-words” from the content. Stop-words are a predetermined set of frequently used words with diminished semantic significance, such as pronouns and prepositions. Lucene filters stop-words to keep the analysis manageable, to eliminate “white noise,” and to improve search heuristics. Lucene analyzes and tokenizes file content into searchable terms based on the white spaces and stop-words omitted from the original text. The tokenizing algorithm affects Data Classification policy rules.

Multi-term Phrase-Based Rules

The Data Classification process allows both single-term keyword searches and multi-term phrase searches. Lucene omits any “stop-word” contained in a multi-term search phrase. For example, a rule containing the phrase “It was the best of times, it was the worst of times,” will classify the file containing the entire sentence, as well as any file containing a contiguous phrase, such as “best times worst times.” To avoid possible false-positive classification, it is best to restrict multi-term phrase searches to meaningful, contiguous terms.

Chinese and Logogrammatic Languages

Some scripts, such as Chinese, represent words by symbols (logograms) and a single word may consist of one or more logograms. Furthermore, while most languages use white spaces to separate words, Chinese, as well as other logogrammatic scripts, often do not separate words by spaces. The combination of these two phenomena, along with Lucene’s omission of white spaces, will cause phrase searches in Chinese (with multiple logograms, separated by spaces) to return positive matches for files containing the same sequence of logograms, regardless of the spaces between them. Thus, a rule containing the phrase “莦 莚 虙贄 蹝 轈”, will classify files containing phrases that consist of these logograms, regardless of spaces. Therefore, the phrases “莦莚虙贄蹝轈,” “莦莚虙贄蹝轈”or “莦莚 虙贄 蹝轈”,and “莦 莚 虙贄 蹝 轈,” will all be classified by the rule defined above. However, single term keyword searches of words consisting of multiple logograms, and phrases not separated by spaces, will return correct, exact match results: a rule containing the term “莦莚虙” will only classify files containing that exact term. If more complex phrases are required, a rule containing multiple phrases with the “Contains All” operator will give the desired results.