Data Classification Components

The Data Classification process assigns categories to business resources according to rules.

Rules are composed of one or more rule criteria. Rule criteria consist of finding a match within files to one or more string or pattern.

The strings can be defined as free text, regular expressions, or one stored as a policy object.

A regular expression in a policy object may be accompanied by a verification algorithm to further narrow down the search.

Note

There are policy objects and verification algorithms out of the box for standard searches, or you can create your own to fit your needs.

The classification rule is the main data classification component. Rules also contain sub-components that complete the rule structure, simplify the rule management task, and provide extended functions.

File properties can be used for classification of files that is performed by the customer manually or using a third-party application. File Access Manager will read the metadata on the files, and can use them for data classification rules. This will include reading metadata from encrypted files.

Data Categories

The data category (the basic component of data classification) is the tag used when a classification rule is satisfied.

To define a data category, open the Manage Categories panel from any of the Data Classification screens.

For example:

Navigate to Compliance > Data Classification > Policies > Actions > Manage Categories or Compliance > Data Classification > Rules > Actions > Manage Categories.

In the Manage Categories window, type the category name in the Add New Category section.
Select Add.

The system adds a new data category to the Current Categories list. Users can edit and delete existing user-defined categories from the Current Categories list. Users can also search categories either by name or by checking the Show user defined categories only checkbox.

Data Classification Policy

The Data Classification Policy is a logical container for data classification rules. For example, all the rules that belong to HIPAA should be located under the HIPAA policy. The system already contains several predefined policies, and users can create additional user-defined policies.

Rules

Policies set the rules for detecting sensitive data to be protected by compliance regulation or by organizational procedure.

File Properties

File Access Manager indexes standard attributes, including extension, size, and file name, and also indexes attributes for office files. All the file properties are discovered and created during the indexing process.

In the web client, navigate to Compliance > Data Classification > Rules > Actions > Manage File Properties or Compliance > Data Classification > Policies > Actions > Manage File Properties to open the Manage File Properties window.

Type in the file property details.
Check the Custom Properties checkbox, if relevant.
Select Add.

Encrypted files

In order to classify encrypted files without File Access Manager reading the file contents, you can tag the files locally according to your classification rules, and use these tags for classification rules (See Local Classification).

Note

If you choose to tag the file using the Tag's property, it will be called Keywords after being uploaded to File Access Manager.

Local Classification

You can use a local classification for files, tagging files with relevant tags. The metadata of the files are uploaded to the File Access Manager database as file properties in the scanning process. These properties can be used to create classification rules manually.

The file properties found will be added automatically to the list of available properties for filtering after the first iteration. In order to have these properties available in the initial run of the Data Classification, add the properties to the property list, as described in File Properties.

Policy Objects

Policy objects are searches, saved for use in rules.

For example, predefined policy objects can search for credit cards.

Navigate to Compliance > Data Classification > Policy Objects to open the Policy Objects page.

Select New Policy Object.

Data classification policy object fields include:

Policy Object Name - Name of the policy object.

Description - Free text description.

Type - The type of search the policy object performs:

  - **Keyword**  
  A keyword may be one or more words. If multiple words are involved, the entire phrase will be searched.  
  Note that stop words such as "a" or "and" are stripped from the search keywords. If you want to include stop keywords in the phrase, you can use a regex phrase instead. (For a nerd-level description of ignoring stop words, see [stopwords](https://www.elastic.co/guide/en/elasticsearch/guide/current/stopwords.html)).

  - **Wildcard**  
  Supports the following special characters:
  - `*` any number of characters
  - `?` only one character

  - **Regular Expression**  
  Using standard regex for defining policies.

Values - Values to search for:

  - **Single Value**
  - **List** – A list of matching values.
  - **Mask Values** (Regular Expression policy objects only)  
  Masking portions of matched values.
  - **Display the first characters** – number of characters from the left displayed in the matched value.
  - **Display the last characters** – number of characters from the right displayed in the matched value.

Verification Algorithm - A code-based algorithm to enable more complex filtering. See Data Classification Verification Algorithms for further details.

Policy objects are a good way to reuse searches containing complex definitions.

Select Save to complete the New Policy Object process.

Classification Types

Regular Expressions Within Policy Objects

Regular expressions form the basis for many content pattern searches. File Access Manager uses the .NET regular expression engine as its underlying engine for regular expression searches. All regular-expression definitions and searches must conform to the engine’s restrictions, limitations, and standards.

When selecting a policy of type Regular Expression, the New Policy Object panel adds the following fields to the New Policy Object panel.

Verification Algorithm - A standard, out-of-the-box example is the Luhn verification algorithm. This algorithm ensures that all phrases classified as credit cards are, indeed, valid credit card numbers (as far as an algorithm can validate without contacting the bank, of course). When selected, this verification will only be run on strings that conform with the credit card regular expression entered, for example:

^3[47][0-9]{13}$

See Data Classification Verification Algorithms for a full description on creating verification algorithms.

Mask Values - By default, the regular-expression matches are saved as part of the results. It is recommended to mask the values of the matches to avoid exposing sensitive data in the File Access Manager database.

Regex Matching and Case

Regex matching is case sensitive by default. To make a regex ignore case, use the prefix “(?!)”.

For example: “home” will find “home”, but ignore “Home”.

The regex “(?!)home” will find “Home”, “HOME” and “HoMe”.

Identifying Line Breaks using Regex in File Access Manager

For parsed files, line breaks are represented by a single CR (\r), instead of (\r\n) or (\n), and therefore not identified by the regex line boundaries ^ and $.

If we take this regex: (?m)(^|\s)up($|\s) and try to match it with the following text (assuming the line breaks are \r) going, up, up, and away!, it will not match anything since the line breaks are not \n as expected by the regex.

In order to identify the start and end of a line, we have to check for the CR explicitly. The issue is that once we identify an end of line character, the cursor has moved past this character, and we can't use this to identify the start of the next line.

If we change the regex to look like this (\r|\s)up(\r|\s).

It’s going to match only the first up, since the \r character will be part of the match and thus not part of the evaluation for the next “up.”

We need to check the previous and next characters, without moving the cursor.

If we try this regex (?<=(\r|\s))up(?=\r|\s), both “up” strings will be matched. This is because of two modifications:

(?<=...) positive lookbehind - When there’s a match, it moves back to assert whether the regex that replaces “...“ is matched, but then discards the match and moves forward to where it was to continue matching.
(?=...) positive lookahead - When there’s a match, it moves forward to assert whether the regex that replaces “...“ is matched, but then discards the match and moves back to where it was to continue matching.

Combining those two means the match contains only “up” without the preceding or following \r, so they can be used for more matches.

These non-capturing matches are known as zero-length assertions. For more information on lookahead and lookbehind assertions (collectively called lookaround) see https://www.regular-expressions.info/lookaround.html.

Examples - To look for rows starting with "John," you could use: (?<=\r|^)John.*(?=\r|$)

To look for rows ending in "Doe," you could use: (?<=\r|^).*Doe(?=\r|$)