Windows Remote File System Crawler

This document describes Windows Remote File System Crawler with Watson Explorer Foundational Components. Watson Explorer has SMB Fileshares Connector to crawl files via SMB protocol. However, SMB Fileshares Connector support SMB v1. So, if files are shared via SMB v2 or SMB v3, SMB Fileshares Connector cannot crawl the files. To crawl files shared via any version of SMB, Windows Remote File System Crawler is introduced.

Before you begin

Applies to Watson Explorer 11.0.2.1 and higher. Microsoft Windows only.

To get the ACLs of files correctly, the Windows system where the Watson Explorer Foundational Component engine is working needs to join the Windows Domain. This is the same as existing SMB Fileshare Connector.

Note: Windows Remote File System Crawler internally calls Windows API to connect remote file system, then crawl files like crawling local files. It is very similar to performing a “Map network drive”, then crawl the files on the drive. So, the version of SMB and other settings of SMB depends on the setting of Windows. For example, you can specify the version of SMB by the setting of Windows.

About this task

In Watson Explorer 11.0.2.3 and higher, Windows Remote File System Crawler is listed in the Add a new seed dialog. By default, filesystem metadata is not crawled. Set Crawl Filesystem Metadata to true to cause the crawler to get the available filesystem metadata (creation date, last modified date, file attributes, etc) about the file. The filesystem metadata needs to be added to the virtual document by the Windows Filesystem Metadata converter.

Windows Remote File System Crawler is not enabled by default in Watson Explorer 11.0.2.1 and 11.0.2.2. To enable it, follow the steps below.

Procedure

  1. Access the Watson Explorer Engine administration tool from a web browser.
  2. Open the XML for the vse-crawler-seed-files function.
  3. Duplicate the XML and click the edit button
  4. Edit the XML to replace after <prototype> tag as follows:
    <prototype>
    
    <label>Windows Remote Files</label>
    
    <description>
    
    Crawl folders and their sub-directories on remote Windows file system.
    
    <p />
    
    </description>
    
    <declare name="files" type="separated-set" type-separator="&#10;" required="required">
    
    </declare>
    
    <declare required="required" name="username" type="string">
    
    <label name="label">Username</label>
    
    <description name="description">
    
    The username to use to access the Windows file system.
    
    </description>
    
    </declare>
    
    <declare required="required" name="password" type="password">
    
    <label name="label">Password</label>
    
    <description name="description">
    
    The password to use to access the Windows file system.
    
    </description>
    
    </declare>
    
    </prototype>
    
    <process-xsl>
    
    <![CDATA[
    
    <xsl:param name="files"/>
    
    <xsl:template match="/">
    
    <xsl:variable name="names" select="str:tokenize($files, '&#10;')"/>
    
    <xsl:variable name="fixed-urls">
    
    <xsl:for-each select="$names">
    
    <xsl:value-of select="viv:if-else(starts-with(., 'file:'), ., viv:url-normalize(viv:url-build('file', '', '', '', -1, viv:if-else(starts-with(., '/'), ., concat('/', .)), '')))"/>
    
    <xsl:text>&#10;</xsl:text>
    
    </xsl:for-each>
    
    </xsl:variable>
    
    <call-function name="vse-crawler-seed-urls-common">
    
    <with name="urls" value="{$fixed-urls}" />
    
    <with name="how" value="path" />
    
    <with name="username" copy-of="username" />
    
    <with name="password" copy-of="password" />
    
    </call-function>
    
    </xsl:template>
    
    ]]>
    
    </process-xsl>
    
    </function>
  5. Create Collection.
  6. Click Configuration > Crawling > Seeds. Click Add a seed.
  7. Choose Windows Remote Files.
  8. Fill in the file path, username, and password. The username and password should have privileges to mount the file system. The username should be specified with the information where the user belongs, such as domain\user1 instead of just user1. The file path should be specified as \<File Server>\<Shared_Directory>.
  9. Start crawling.
    Note: When the username is changed after crawling, reboot the server. Otherwise, the crawling will be done with the old username.