Skip to content

Data Classification

Data Classification categorizes and tags business resources (BRs) based on the following:

  • Content
  • Behavior
  • Imported designation

Classification is done by identifying resources with specific data or resources accessed by specific user types, according to standard and user-defined policies.

This section describes the data classification feature in File Access Manager and the operations available on the web application, which can be found by navigating to Compliance > Data Classification.

Overview

File Access Manager's Data Classification engine is a mechanism to classify organizational data and apply categories based on both content and behavioral analysis of files and Business Resources (Data Assets) residing on various applications.

You can use Data Classification to create policies and rules to address well-known or widely used regulation compliance requirements such as GDPR, CCPA, HIPAA, ICD, LPGD, and more.

The Data Classification mechanism provides both content-based and behavioral-based analysis of files and BRs residing on various applications, which facilitates their classification into categories based on those analyses.

Content-Based Classification parses and indexes the files’ textual content and searches for specific patterns according to predefined sets of rules. These patterns can consist of sensitive keywords or keyword lists, complex regular expressions representing patterns such as Social Security Numbers (SSNs) and credit card numbers, and other user-defined parameters.

Behavioral-Based Classification analyzes the activity information gathered by File Access Manager and can be used to classify business resources (BR) based on the type of users who access the files frequently.

Content-Based Classification - Searches files for specific content of interest, such as SSNs, credit card numbers, health records, etc.

Behavioral-Based Classification - Analyzes BRs according to properties of users who access this data. For example, if members of the board of directors or members of the finance department regularly use these BRs.

The classification of both content and behavioral data depends on user-configurable criteria. Classification results can serve as a data source on their own and can form the basis of queries on the forensics screens (See the chapter on forensics). However, classification results also serve as an additional information layer, associated with activities and permission data.

The classification results layer connects all other layers with data.

The Data Classification module supports using external classification of files in one of the following methods:

  • DC Import - Importing a spreadsheet into File Access Manager listing files and directories assigned to categories.
  • Writing to file properties, and creating rules in File Access Manager, assigning categories to files that contain those properties.

These methods can even be used for encrypted files without File Access Manager reading the file content.

Classification Architecture and Flow Architecture

The Data Classification content indexing is performed by the Central Data Classification services and their associated Collectors. Architecture has additional information on the possible deployment models and how to scale the Data Classification Collectors to achieve greater speed and performance.

The Central Classification service reads the BRs eligible for indexing and sends them to the Collectors. The Collectors index the files in the received BRs according to the defined data classification policy, and send the results back to the Central Service to be saved in the database.

The Collectors no longer keep a persisted full text index on disk, since all the processing is done in-memory.

Content Classification Process

The classification processes (run concurrently and independently) include:

  • Classification Policy Management and Update
  • Running a Content Indexing task
  • Querying and Retrieving Results

Classification Policy Management and Updates

Once a Content Indexing task is issued, the Data Classification Engine reads the most updated policy definition. That policy definition will persist through the duration of the Content Indexing task. Any changes made to the policy definition after the Content Indexing task has been started will not be reflected in the current classification process.

Indexing Flow

The classification engine Content Indexing Task:

  1. The central service retrieves the BRs to be indexed from the File Access Manager database, but only when:

  2. This is the first indexing run of a business resource

  3. The last modified business resource date is more recent than the last business resource indexing date
  4. The business resource is included in the Scope of the Application
  5. The business resource is not contained in a de-duplicated share
  6. If the data classification policy was changed from the last indexing tasks, all the BRs will be re-indexed

  7. The central service sends the BRs to the Collectors.

  8. The Collector retrieves the list of files in each business resource.
  9. Reads the content of each file.
  10. Indexes and classifies the file content and sends the results to the Central Data Classification to be saved into the database.

Data Classification Deduplication Scan

In CIFS systems, it is possible for multiple shares to point to the same physical address (where they are considered “duplicate shares”).

To minimize the running time of the Data Classification task, these duplicate shares are identified, and shared data is scanned only once.

When a user queries the Forensics tab of Data Classification, the classification results are reflected through all duplicate shares.

The following scenario involves four shares in a Windows server:

  • Share1 points to D:\
  • Share2 points to D:\folder1
  • Share3 points to D:\
  • Share4 points to E:\

The results of the deduplication scan will be:

  • Share1 will be scanned completely.
  • Share2 will be skipped, since Share1 contains Share2.
  • Share3 will be skipped, since Share1 is equal to Share3.
  • Share4 will be scanned completely.

When a user queries the Forensics tab of Data Classification, the user will receive the results of all shares.

Limitations and Known Issues

If the Crawler excludes BRs in contained shares, Data Classification will not classify those BRs.

Re-Indexing Scenarios

Every data classification policy change will cause all the BRs to be re-indexed on the next indexing task. The assumption is that the policy remains static and unchanged after the implementation and testing phase are completed. File Access Manager provides different features to limit the scope of the indexed BRs to be able to test the policy changes faster, such as Scoping and Run a Specific Resource Classification task.

The Data Classification engine uses Lucene as its primary, text-based optimized database. The Lucene database provides term-based search capabilities, based on the textual content extracted and analyzed from indexed files.

To extract textual content from various file types and formats, the classification engine uses a proprietary text extraction library, which is able to extract the file content based on its type. Based on the extracted content, the Lucene indexing service parses and analyzes file content into an index of searchable terms. The full content of the files itself is not saved as part of the index, which allows the index to remain relatively small, highly efficient, and optimized for term-based textual searches.

When the system compares a Content Classification request with the textual index, it parses and translates the various policy rules into term-based search queries. Query results, representing files that correspond to the rules’ requirements, consist of file names, extensions, and full path locations, along with other attributes. In certain cases, results may include an actual term or phrase that matches the rule-based query, rather than the full content of the file.

Regular-Expressions - Regular-expression-based rules involves matching regular-expression patterns with a file during the process of reading the file content, and not comparing the pattern with a term-based index.

Lucene’s Indexing Process - While it parses and analyzes the content data, the Lucene index analyzer eliminates white spaces, certain punctuation characters, and “stop-words” from the content. Stop-words are a predetermined set of frequently used words with diminished semantic significance, such as pronouns and prepositions. Lucene filters stop-words to keep the index manageable, to eliminate “white noise,” and to improve search heuristics. Lucene analyzes and tokenizes file content into searchable terms based on the white spaces and stop-words omitted from the original text. The tokenizing algorithm affects Data Classification policy rules.

Multi-term Phrase-Based Rules - The Data Classification engine allows both single-term keyword searches and multi-term phrase searches. Lucene omits any “stop-word” contained in a multi-term search phrase. For example, a rule containing the phrase “It was the best of times, it was the worst of times,” will classify the file containing the entire sentence, as well as any file containing a contiguous phrase, such as “best times worst times.” To avoid possible false-positive classification, it is best to restrict multi-term phrase searches to meaningful, contiguous terms.

Chinese and Logogrammatic Languages

Some scripts, such as Chinese, represent words by symbols (logograms), and a single word may consist of one or more logograms. Furthermore, while most languages use white spaces to separate words, Chinese, as well as other logogrammatic scripts often do not separate words by spaces. The combination of these two phenomena, along with Lucene’s omission of white spaces, will cause phrase searches in Chinese (with multiple logograms, separated by spaces) to return positive matches for files containing the same sequence of logograms, regardless of the spaces between them. Thus, a rule containing the phrase “莦 莚 虙贄 蹝 轈”, will classify files containing phrases that consist of these logograms, regardless of spaces. Therefore, the phrases “莦莚虙贄蹝轈,” “莦莚虙贄蹝轈”or “莦莚 虙贄 蹝轈”,and “莦 莚 虙贄 蹝 轈,” will all be classified by the rule defined above. However, single term keyword searches of words consisting of multiple logograms, and phrases not separated by spaces, will return correct, exact match results: a rule containing the term “莦莚虙” will only classify files containing that exact term. If more complex phrases are required, a rule containing multiple phrases with the “Contains All” operator will give the desired results.

Optical Character Recognition (OCR)

File Access Manager can identify text from within image files either directly or embedded in other files – such as scanned documents or a collection of scans stored in a zip file. Files less than 1000 pixels across will not be scanned to avoid less reliable results from low-resolution images.

The data privacy engine can analyze files containing sensitive data in image form.

Note

The optical character recognition process is resource-intensive and should be configured carefully taking the run-time into consideration. It is disabled by default.

OCR capability can be added to the scope selected in the DSAR Scope screen.

Enabling Optical Character Recognition

In the case of OCR scanning, enabling will cause the next task to re-index the Data Classification. Disabling the OCR capability will not initiate re-indexing. This means that once files are marked as sensitive, we can turn off the resource intensive optical character recognition process without removing this indication, until any other filtering setting is changed.

By default, optical character recognition is disabled on the entire scope of the DSAR. To enable optical character recognition on a resource, edit the application scope line.

  1. Find the desired application from the DSAR Scope screen.
  2. Select Edit.
  3. Select Optical Character Recognition (OCR) to enable OCR analysis for this application.