Configuring and Scheduling the Crawler
To set or edit the Crawler configuration and scheduling, open the edit screen of the required application.
-
Go to Admin > Applications.
-
Scroll through the list or use the filter to find the application.
-
Select the Edit icon
on the line of the application.
-
Press Next until you reach the Crawler & Permissions Collection settings page.
The actual entry fields vary according to the application type.
Create a Schedule
Select to open the schedule panel. See Scheduling a Task
Setting Crawl Scope
There are several options to set the crawl scope:
-
Setting explicit list of resources to include and / or exclude from the scan.
-
Creating a regex to define resources to exclude.
Including and Excluding Paths by List
To set the paths to include or exclude in the crawl process for an application, open the edit screen of the required application.
-
Go to Admin > Applications.
-
Scroll through the list or use the filter to find the application.
-
Select the Edit icon
on the line of the application.
-
Press Next until you reach the Crawler & Permissions Collection settings page.
The actual entry fields vary according to the application type.
-
Scroll down to the Crawl configuration settings.
-
Select Advanced Crawl Scope Configuration to open the scope configuration panel.
-
Select Include / Exclude Resources to open the input fields.
-
To add a resource to a list, type in the full path to include / exclude in the top field and select + to add it to the list.
-
To remove a resource from a list, find the resource from the list, and select the x icon on the resource row.
Note
When creating exclusion lists, excludes take precedence over includes.
Excluding Paths by Regex
To set filters of paths to exclude in the crawl process for an application using regex, open the edit screen of the required application.
-
Go to Admin > Applications.
-
Scroll through the list or use the filter to find the application.
-
Select the Edit icon
on the line of the application.
-
Press Next until you reach the Crawler & Permissions Collection settings page.
The actual entry fields vary according to the application type.
-
Select Exclude Paths by Regex to open the configuration panel.
-
Type in the paths to exclude by Regex. Since the system does not collect BRs that match this Regex, it also does not analyze them for permissions.
Notes
- To write a backslash (
\
) or a dollar sign ($
), add a backslash before it as an escape character. - To add a condition in a single command, use a pipe character
|
.
Crawler Regex Exclusion Examples - General
The following are examples of crawler Regex exclusions.
Exclude all shares which start with one or more share names
Action | Example | Regex |
---|---|---|
Exclude all shares starting with a specific name | \\server_name\shareName |
\\\\server_name\\shareName$ |
Exclude all shares starting with multiple names | \\server_name\shareName or \\server_name\OtherShareName |
\\\\server_name\\(shareName|OtherShareName)$ |
Include ONLY shares which start with one or more share names
Action | Example | Regex |
---|---|---|
Include ONLY shares starting with a specific name | \\server_name\shareName |
^(?!\\\\server_name\\shareName($|\\.*)).* |
Include ONLY shares starting with multiple names | \\server_name\shareName or \\server_name\OtherShareName |
^(?!\\\\server_name\\(shareName|OtherShareName)($|\\.*)).* |
Narrow down the selection
Action | Example | Regex |
---|---|---|
Include ONLY the C$ drive shares | \\server_name\C$ |
^(?!\\\\server_name\\C\$($|\\.*)).* |
Include ONLY one folder under a share | \\server\share\folderA |
^(?!\\\\server_name\\share\$($|\\folderA$|\\folderA\\.*)).* |
Include ONLY all administrative shares | - | ^(?!\\\\server_name\\[a-zA-Z]\$($|)).* |
Crawler Regex Exclusion Examples - Linux
Action | Example | Regex |
---|---|---|
Exclude a path | The path /root |
^\/root($|\\.*) |
Exclude multiple paths | The paths /root and /media |
^(\/root|\/media)($|\\.*) |
Include only a path | The path /home (parent directories like / must also be added) |
^(?!(\/|\/home)($|\/.*)).* |
Include multiple paths | The paths /home and /boot (parent directories like / must also be added) |
^(?!(\/|\/home|\/boot)($|\/.*)).* |
Crawler Regex Exclusion Examples - Google Drive
Exclude all drives that start with one or more user names:
Action | Example | Regex |
---|---|---|
Exclude all drives starting with a specific user name | Starting with John.Doe |
^Users\\John\.Doe@.* |
Exclude all drives starting with multiple user names | Starting with John.Doe or Jane.Doe |
^Users\\(John|Jane)\.Doe@.* |
Include ONLY drives that start with one or more user names:
Action | Example | Regex |
---|---|---|
Include ONLY drives starting with a specific user name | Starting with John.Doe |
^(?!Users\\John\.Doe@.*).* |
Include ONLY drives starting with multiple user names | Starting with John.Doe or Jane.Doe |
^(?!Users\\(John|Jane)\.Doe@.*).* |
The AWS Path Structure in File Access Manager
File Access Manager uses a path name in the following structure:
- Path Structure:
Root/[OU]/[Account]/[Bucket Path]/[Folder]/[Filename]
- Component structure:
Root/[OU]/[OU2]/[Account name](#[Account ID])/s3.[region].[bucket name]/[folder]/[file name]
- Example:
Root/Example-OU/Example-Account(#420269343516)/s3.north-east-17.HR3InputDataBucket/Prospects/CVs/SueSmithPM.Docx
Root
All paths start with Root/
OU
The organizational unit. This could be empty, or include a sting of one or more OUs, according to the BR hierarchical structure.
Account
Since account names are not unique under an organization, this string includes the account ID and the account name
[Account name](#[Account ID])
Bucket Path
The bucket section of the path starts with "s3." and includes the region
s3.[region].[bucket]
Crawler Regex Exclusion Examples - AWS S3 Buckets
Exclude all Folders Which Start With One or More Folder Names:
Action | Regex |
---|---|
Starting with bucket_name/folderName |
bucket_name/folderName$ |
Starting with bucket_name/folderName or bucket_name/OtherFolderName |
bucketName/(folderName|OtherFolderName)$ |
Include ONLY Folders Which Start With One or More Folder Names:
Action | Regex |
---|---|
Starting with bucket_name/shareName |
^(?!bucket_name/shareName($|/.*)).* |
Starting with bucket_name/folderName or bucket_name/OtherFolderName |
^(?!bucket_name/(folderName|OtherFolderName)($|/.*)).* |
Excluding Top Level Resources
Use the top level exclusion screen to select top level roots to exclude from the crawl. This setting is done per application.
To exclude top level resources from the crawl process:
-
Go to Admin > Applications.
-
Find the application to configure and select the drop-down menu on the application line. Select Exclude Top Level Resources to open the configuration panel.
-
The Run Task button triggers a task that runs a short detection scan to detect the current top level resources. If the top-level resource list has changed in the application while you are on this screen, press this button to retrieve the updated structure.
Once triggered, you can see the task status in Settings > Task Management > Tasks.
Note
This will only work if the user has access to the task page.
When the task has completed, press Refresh to update the page with the list of top level resources.
-
Select the top level resource list, and select top level resources to exclude.
-
Select Save to save the change.
-
To refresh the list of top level resources, run the task again. Running the task will not clear the list of top level resources to exclude.
Special Consideration for Long File Paths in Crawl
If you need to support long file paths above 4,000 characters for the crawl, set the flag excludeVeryLongResourcePaths
in the Permission Collection Engine App.config file to true
.
By default, this value will be commented out and set to false
.
This key ensures, when enabled, that paths longer than 4,000 characters are excluded from the applications’ resource discovery (Crawl), to avoid issues while storing them in the SQL Server database.
When enabled, business resources with full paths longer than 4,000 characters, and everything included in the hierarchical structure below them, will be excluded from the crawl, and will not be collected by File Access Manager. This scenario is extremely rare.
Note
You should not enable exclusion of long paths, unless you experience an issue.
Background
File Access Manager uses a hashing mechanism to create a unique identifier for each business resource stored in the File Access Manager database. The hashing mechanism in SQL Server versions 2014 and earlier is unable to process (hash) values with 4,000 or more characters.
Though resources with paths of 4,000 characters or longer are extremely rare, File Access Manager is designed to handle that limitation.
Identifying the Problem
When using an SQL Server database version 2014 and earlier, the following error message will appear in the Permission Collection Engine log file:
System.Data.SqlClient.SqlException (0x80131904): String or binary data would be truncated.
In all other cases, this feature should not be enabled.
Setting the Long Resource Path Key
The Permission Collection Engine App.config
file is RoleAnalyticsServiceHost.exe.config
, and can be found in the folder: %SailPoint_Home%\FileAccessManager\[Permission Collection instance]\
Search for the key excludeVeryLongResourcePaths
and correct it as described above.