Content Classification Process
The classification processes (run concurrently and independently) include:
-
Classification Policy Management and Update
-
Running a Content Indexing task
-
Querying and Retrieving Results
Classification Policy Management and Updates
Once a Content Indexing task is issued, the Data Classification Engine reads the most updated policy definition. That policy definition will persist through the duration of the Content Indexing task. Any changes made to the policy definition after the Content Indexing task has been started will not be reflected in the current classification process.
Indexing Flow
The classification engine Content Indexing Task:
-
The central service retrieves the BRs to be indexed from the File Access Manager database but only when:
-
This is the first indexing run of a business resource
-
The last modified business resource date is more recent than the last business resource indexing date
-
The business resource is included in the Scope of the Application
-
The business resource is not contained in a de-duplicated share
-
If the data classification policy was changed from the last indexing tasks, all the BRs will be re-indexed
-
-
The central service sends the BRs to the Collectors.
-
The Collector retrieves the list of files in each business resource.
-
Reads the content of each file.
-
Indexes and classifies the file content and sends the results to the Central Data Classification to be saved into the database.
Data Classification Deduplication Scan
In CIFS systems it is possible for multiple shares to point to the same physical address (where they are considered “duplicate shares”).
To minimize the running time of the Data Classification task, these duplicate shares are identified, and shared data is scanned only once.
When a user queries the Forensics tab of Data Classification, the classification results are reflected through all duplicate shares.
The following scenario involves four shares in a Windows server:
-
Share1 points to D:\
-
Share2 points to D:\folder1
-
Share3 points to D:\
-
Share4 points to E:\
The results of the deduplication scan will be:
-
Share1 will be scanned completely.
-
Share2 will be skipped, since Share1 contains Share2
-
Share3 will be skipped, since Share1 is equal to Share3.
-
Share4 will be scanned completely.
When a user queries the Forensics tab of Data Classification, the user will receive the results of all shares.
Limitations and Known Issues:
If the Crawler excludes BRs in contained shares, Data Classification will not classify those BRs.