Configuring and Scheduling the OneDrive Crawler
Permissions can be analyzed to determine the application permissions of an out-of-the-box application, provided you have defined an identity store for File Access Manager to use in its analysis, and you have run a crawl for the application.
Configuring the Permission Collection
The permission collector is a software component responsible for analyzing the permissions in an application. The Central Permission Collector Service is responsible for running the Permission Collector and Crawler tasks.
Note
If using a proxy in your File Access Manager environment, see How to Use Proxy in a File Access Manager Environment in the Azure File Guide.
To configure the permission collector:
- Go to Admin > Applications.
- Scroll through the list or use the filter to find the application.
- Select the Edit icon
on the application row.
-
Select Next until you reach the Permissions Collector settings page.
Note
The entry fields vary by application type.
-
Select Central Permissions Collection to create permissions collection services as part of the service installation process.
- Select Skip Identities Sync during Permissions Collection to skip identity synchronization before running the permission collection tasks when the identity collector is common to different connector. This option is enabled by default.
You can now schedule a task.
Scheduling a Task
To create a schedule:
- Select Create a Schedule.
- The system will provide a Schedule Name in the format
{appName} - {type} Scheduler
. Choose to keep or override this suggestion. -
Select a scheduling frequency from the dropdown list.
Schedule Frequency Options
- Run After - Create dependency of tasks. The task starts running only upon successful completion of the first task.
- Hourly - Set the start time.
- Daily - Set the start date and time.
- Weekly - Set the day(s) of the week on which to run.
- Monthly - Set the day of the month on which to run a task.
- Quarterly - Set a monthly schedule with an interval of 3 months.
- Half Yearly - Set a monthly schedule with an interval of 6 months.
- Yearly - Set a monthly schedule with an interval of 12 months.
-
Fill the Date and Time field with scheduling times. These fields differ depending upon the scheduling frequency selected.
- Select the Active checkbox to activate the schedule.
- Select Next.
Configuring and Scheduling the Crawler
To configure the crawler:
- Go to Admin > Applications.
- Scroll through the list or use the filter to find the application.
- Select the Edit icon
on the application row.
-
Select Next until you reach the Crawler & Permissions Collector settings page.
Note
The entry fields vary by application type.
Calculate Resource Size - Determine when, or at what frequency, File Access Manager calculates the resources' size. Select one of the following:
- Never
- Always
-
Second crawl and on (this is the default)
You can now schedule a task.
Setting the Crawl Scope
There are several options to set the crawl scope:
- Setting explicit list of resources to include and / or exclude from the scan.
- Creating a regex to define resources to exclude.
Including and Excluding Paths by List
To set the paths to include or exclude in the crawl process for an application:
- Go to Admin > Applications.
- Scroll through the list or use the filter to find the application.
- Select the Edit icon
on the application row.
- Select Next until you reach the Crawler & Permissions Collector settings page.
- Scroll down to the Crawl configuration settings.
- Select Advanced Crawl Scope Configuration to open the scope configuration panel.
- Select Include / Exclude Resources to open the input fields.
- To add a resource to a list, type in the full path to include/exclude in the top field and select + to add it to the list.
- To remove a resource from a list, find the resource from the list and select the x icon on the resource row.
Note
When creating exclusion lists, excludes take precedence over includes.
Excluding Paths by Regex
- Go to Admin > Applications.
- Scroll through the list or use the filter to find the application.
- Select the Edit icon
on the application row.
- Select Next until you reach the Crawler & Permissions Collector settings page.
- Select Exclude Paths by Regex to open the configuration panel.
- Type in the paths to exclude by Regex, See regex examples in the section below. Since the system does not collect BRs that match this Regex, it also does not analyze them for permissions.
Crawler Regex Examples
The following are examples of crawler Regex exclusions:
Exclude all resources which start with one or more resource names:
-
Example: Starting with https://www.mysharepoint.com/resourceName
- Regex: https:\/\/www.mysharepoint.com\/resourceName$
-
Example: Starting with https://www.mysharepoint.com\resourceName or //www.mysharepoint.com/OtherResourceName
- Regex: https:\/\/www.mysharepoint.com\/(resourceName|OtherResourceName)$
-
Example: SharePoint resources starting with https://www.mysharepoint.com/sites/mySiteCollection
- Regex: https:\/\/www.mysharepoint.com\/sites\/mySiteCollection$
-
Example: SharePoint resources starting with http://www.mysharepoint.com/sites/mySiteCollection or http://www.mysharepoint.com/other site/Different Site
- Regex: https:\/\/www.mysharepoint.com\/(sites\/mySiteCollection|other_site\/Different_Site)$
Include ONLY resources which start with one or more resources names:
-
Example: ^(?!https:\/\/www.mysharepoint.com\/resourceName($|\/.)).
- Regex: ^(?!https:\/\/www.mysharepoint.com\/resourceName($|\/.)).
-
Example: Starting with https://www.mysharepoint.com/resourceName or https://www.mysharepoint.com/OtherResourceName
- Regex: ^(?!https:\/\/www.mysharepoint.com\/(resourceName|OtherResourceName)($|\/.)).
-
Example: SharePoint resources starting with https://www.mysharepoint.com/sites/mySiteCollection
- Regex: ^(?!https:\/\/www.mysharepoint.com\/sites\/mySiteCollection($|\/.)).
-
Example: SharePoint resources starting with https://www.mysharepoint.com/sites/mySiteCollection or https://www.mysharepoint.com/other site/Different_Site
- Regex: ^(?!https:\/\/www.mysharepoint.com\/(sites\/mySiteCollection|other_ site\/Different_Site)($|\/.)).
Excluding Top-Level Resources
Use the top-level exclusion screen to select top-level roots to exclude from the crawl. This setting is done per application.
To exclude top-level resources from the crawl process:
- Go to Admin > Applications.
- Find the application to configure and select the dropdown list menu on the application line. Select Exclude Top Level Resources to open the configuration panel.
- Select the Run Task button to trigger a task that runs a short detection scan to detect the current top-level resources. If the top-level resource list has changed in the application while you are on this screen, select the Run Task button to retrieve the updated structure.
- Once triggered, you can view the task status in Settings > Task Management > Tasks, depending on your access to the task page.
- When the task has completed, select Refresh to update the page with the list of top-level resources.
-
Select the top-level resource list and choose top-level resources to exclude.
Note
If all resources are selected and you wish for them to be deselected, select Deselect All. You can also select individual resources.
-
Select Save to save the change.
- To refresh the list of top-level resources, run the task again. Running the task will not clear the list of top-level resources to exclude.
Special Consideration for Long File Paths in Crawl
If you need to support long file paths above 4,000 characters for the crawl, set the flag excludeVeryLongResourcePaths
in the Permission Collection Engine App.config file to true
.
By default, this value will be commented out and set to false
.
This key ensures, when enabled, that paths longer than 4,000 characters are excluded from the applications’ resource discovery (Crawl), to avoid issues while storing them in the SQL Server database.
When enabled, business resources with full paths longer than 4,000 characters, and everything included in the hierarchical structure below them, will be excluded from the crawl and will not be collected by File Access Manager. This scenario is extremely rare.
Note
You should not enable exclusion of long paths unless you experience an issue.
Background
File Access Manager uses a hashing mechanism to create a unique identifier for each business resource stored in the File Access Manager database. The hashing mechanism in SQL Server versions 2014 and earlier is unable to process (hash) values with 4,000 or more characters.
Though resources with paths of 4,000 characters or longer are extremely rare, File Access Manager is designed to handle that limitation.
Identifying the Problem
When using an SQL Server database version 2014 and earlier, you may see the following error message in the Permission Collection Engine log file:
System.Data.SqlClient.SqlException (0x80131904): String or binary data would be truncated
In all other cases, this feature should not be enabled.
Setting the Long Resource Path Key
The Permission Collection Engine App.config file is RoleAnalyticsServiceHost.exe.config
, and can be found in the folder
%SailPoint_Home%\FileAccessManager\[Permission Collection instance]\
.
Search for the key excludeVeryLongResourcePaths and correct it as described above.