General Information
Classification and Flow Architecture
The Data Classification content indexing is performed by the Central Data Classification services and their associated collectors.
The Central Classification service reads the Business Resources eligible for indexing and sends them to the Collectors. The collectors index the files in the received Business Resources according to the defined data classification policy and send the results back to the Central Service to be saved in the database.
The collectors no longer keep a persisted, full text index on disk, since all the processing is done in-memory.
Content Classification Process
The classification processes (run concurrently and independently) include:
- Classification Policy Management and Update
- Running a Content Indexing Task
- Querying and Retrieving Results
Classification Policy Management and Updates
Once a Content Indexing task is issued, the Data Classification engine reads the most updated policy definition. That policy definition will persist through the duration of the Content Indexing task. Any changes made to the policy definition after the Content Indexing task has been started will not be reflected in the current classification process.
Indexing Flow
- The central service retrieves the Business Resources to be indexed from the Data Access Security database but only when:
- This is the first indexing run of a business resource
- The last modified business resource date is more recent than the last business resource indexing date
- The business resource is included in the Scope of the application
- The business resource is not contained in a de-duplicated share
- If the data classification policy was changed from the last indexing task, all the Business Resources will be re-indexed
- The central service sends the Business Resources to the collectors.
- The collector retrieves the list of files in each business resource.
- Reads the content of each file.
- Indexes and classifies the file content and sends the results to the Central Data Classification to be saved into the database.
Data Classification Deduplication Scan
In CIFS systems, it is possible for multiple shares to point to the same physical address (where they are considered duplicate shares).
To minimize the running time of the Data Classification task, these duplicate shares are identified and shared data is scanned only once.
When a user queries the Forensics tab of Data Classification, the classification results are reflected through all duplicate shares.
The following scenario involves four shares in a Windows server:
- Share1 points to D:\
- Share2 points to D:\folder1
- Share3 points to D:\
- Share4 points to E:\
The results of the deduplication scan will be:
- Share1 will be scanned completely
- Share2 will be skipped, since Share1 contains Share2
- Share3 will be skipped, since Share1 is equal to Share3
- Share4 will be scanned completely
Limitations and Known Issues:
If the Crawler excludes Business Resources in contained shares, Data Classification will not classify those Business Resources.