Skip to content

Configuring and Scheduling the Crawler

To set or edit the Crawler configuration and scheduling, open the edit screen of the required application.

  1. Go to Admin > Applications.

  2. Scroll through the list or use the filter to find the application.

  3. Select the Edit icon on the line of the application.

  4. Press Next until you reach the Crawler & Permissions Collection settings page.

The actual entry fields vary according to the application type.

Create a Schedule

Select to open the schedule panel. See Scheduling a Task

Setting Crawl Scope

There are several options to set the crawl scope:

  • Setting explicit list of resources to include and / or exclude from the scan.

  • Creating a regex to define resources to exclude.

Including and Excluding Paths by List

To set the paths to include or exclude in the crawl process for an application, open the edit screen of the required application.

  1. Go to Admin > Applications.

  2. Scroll through the list or use the filter to find the application.

  3. Select the Edit icon on the line of the application.

  4. Press Next until you reach the Crawler & Permissions Collection settings page.

    The actual entry fields vary according to the application type.

  5. Scroll down to the Crawl configuration settings.

  6. Select Advanced Crawl Scope Configuration to open the scope configuration panel.

  7. Select Include / Exclude Resources to open the input fields.

  8. To add a resource to a list, type in the full path to include / exclude in the top field and select + to add it to the list.

  9. To remove a resource from a list, find the resource from the list, and select the x icon on the resource row.

Note

When creating exclusion lists, excludes take precedence over includes.

Excluding Paths by Regex

To set filters of paths to exclude in the crawl process for an application using regex, open the edit screen of the required application.

  1. Go to Admin > Applications.

  2. Scroll through the list or use the filter to find the application.

  3. Select the Edit icon on the line of the application.

  4. Press Next until you reach the Crawler & Permissions Collection settings page.

    The actual entry fields vary according to the application type.

  5. Select Exclude Paths by Regex to open the configuration panel.

  6. Type in the paths to exclude by Regex. Since the system does not collect BRs that match this Regex, it also does not analyze them for permissions.

Notes

  • To write a backslash (\) or a dollar sign ($), add a backslash before it as an escape character.
  • To add a condition in a single command, use a pipe character |.

Crawler Regex Exclusion Examples - General

The following are examples of crawler Regex exclusions.

Exclude all shares which start with one or more share names

Action Example Regex
Exclude all shares starting with a specific name \\server_name\shareName \\\\server_name\\shareName$
Exclude all shares starting with multiple names \\server_name\shareName or \\server_name\OtherShareName \\\\server_name\\(shareName|OtherShareName)$

Include ONLY shares which start with one or more share names

Action Example Regex
Include ONLY shares starting with a specific name \\server_name\shareName ^(?!\\\\server_name\\shareName($|\\.*)).*
Include ONLY shares starting with multiple names \\server_name\shareName or \\server_name\OtherShareName ^(?!\\\\server_name\\(shareName|OtherShareName)($|\\.*)).*

Narrow down the selection

Action Example Regex
Include ONLY the C$ drive shares \\server_name\C$ ^(?!\\\\server_name\\C\$($|\\.*)).*
Include ONLY one folder under a share \\server\share\folderA ^(?!\\\\server_name\\share\$($|\\folderA$|\\folderA\\.*)).*
Include ONLY all administrative shares - ^(?!\\\\server_name\\[a-zA-Z]\$($|)).*

Crawler Regex Exclusion Examples - Linux

Action Example Regex
Exclude a path The path /root ^\/root($|\\.*)
Exclude multiple paths The paths /root and /media ^(\/root|\/media)($|\\.*)
Include only a path The path /home (parent directories like / must also be added) ^(?!(\/|\/home)($|\/.*)).*
Include multiple paths The paths /home and /boot (parent directories like / must also be added) ^(?!(\/|\/home|\/boot)($|\/.*)).*

Crawler Regex Exclusion Examples - Google Drive

Exclude all drives that start with one or more user names:

Action Example Regex
Exclude all drives starting with a specific user name Starting with John.Doe ^Users\\John\.Doe@.*
Exclude all drives starting with multiple user names Starting with John.Doe or Jane.Doe ^Users\\(John|Jane)\.Doe@.*

Include ONLY drives that start with one or more user names:

Action Example Regex
Include ONLY drives starting with a specific user name Starting with John.Doe ^(?!Users\\John\.Doe@.*).*
Include ONLY drives starting with multiple user names Starting with John.Doe or Jane.Doe ^(?!Users\\(John|Jane)\.Doe@.*).*

The AWS Path Structure in File Access Manager

File Access Manager uses a path name in the following structure:

  • Path Structure: Root/[OU]/[Account]/[Bucket Path]/[Folder]/[Filename]
  • Component structure: Root/[OU]/[OU2]/[Account name](#[Account ID])/s3.[region].[bucket name]/[folder]/[file name]
  • Example: Root/Example-OU/Example-Account(#420269343516)/s3.north-east-17.HR3InputDataBucket/Prospects/CVs/SueSmithPM.Docx

Root

All paths start with Root/

OU

The organizational unit. This could be empty, or include a sting of one or more OUs, according to the BR hierarchical structure.

Account

Since account names are not unique under an organization, this string includes the account ID and the account name

[Account name](#[Account ID])

Bucket Path

The bucket section of the path starts with "s3." and includes the region

s3.[region].[bucket]

Crawler Regex Exclusion Examples - AWS S3 Buckets

Exclude all Folders Which Start With One or More Folder Names:

Action Regex
Starting with bucket_name/folderName bucket_name/folderName$
Starting with bucket_name/folderName or bucket_name/OtherFolderName bucketName/(folderName|OtherFolderName)$

Include ONLY Folders Which Start With One or More Folder Names:

Action Regex
Starting with bucket_name/shareName ^(?!bucket_name/shareName($|/.*)).*
Starting with bucket_name/folderName or bucket_name/OtherFolderName ^(?!bucket_name/(folderName|OtherFolderName)($|/.*)).*

Excluding Top Level Resources

Use the top level exclusion screen to select top level roots to exclude from the crawl. This setting is done per application.

To exclude top level resources from the crawl process:

  1. Go to Admin > Applications.

  2. Find the application to configure and select the drop-down menu on the application line. Select Exclude Top Level Resources to open the configuration panel.

  3. The Run Task button triggers a task that runs a short detection scan to detect the current top level resources. If the top-level resource list has changed in the application while you are on this screen, press this button to retrieve the updated structure.

    Once triggered, you can see the task status in Settings > Task Management > Tasks.

    Note

    This will only work if the user has access to the task page.

    When the task has completed, press Refresh to update the page with the list of top level resources.

  4. Select the top level resource list, and select top level resources to exclude.

  5. Select Save to save the change.

  6. To refresh the list of top level resources, run the task again. Running the task will not clear the list of top level resources to exclude.

Special Consideration for Long File Paths in Crawl

If you need to support long file paths above 4,000 characters for the crawl, set the flag excludeVeryLongResourcePaths in the Permission Collection Engine App.config file to true.

By default, this value will be commented out and set to false.

This key ensures, when enabled, that paths longer than 4,000 characters are excluded from the applications’ resource discovery (Crawl), to avoid issues while storing them in the SQL Server database.

When enabled, business resources with full paths longer than 4,000 characters, and everything included in the hierarchical structure below them, will be excluded from the crawl, and will not be collected by File Access Manager. This scenario is extremely rare.

Note

You should not enable exclusion of long paths, unless you experience an issue.

Background

File Access Manager uses a hashing mechanism to create a unique identifier for each business resource stored in the File Access Manager database. The hashing mechanism in SQL Server versions 2014 and earlier is unable to process (hash) values with 4,000 or more characters.

Though resources with paths of 4,000 characters or longer are extremely rare, File Access Manager is designed to handle that limitation.

Identifying the Problem

When using an SQL Server database version 2014 and earlier, the following error message will appear in the Permission Collection Engine log file:

System.Data.SqlClient.SqlException (0x80131904): String or binary data would be truncated.

In all other cases, this feature should not be enabled.

Setting the Long Resource Path Key

The Permission Collection Engine App.config file is RoleAnalyticsServiceHost.exe.config, and can be found in the folder: %SailPoint_Home%\FileAccessManager\[Permission Collection instance]\

Search for the key excludeVeryLongResourcePaths and correct it as described above.