Backing Up Watson Explorer Engine

You can always back up your Watson™ Explorer Engine installation by adding its installation directory to the list of standard directories that you back up. The default location of the Watson Explorer Engine installation directory is /opt/ibm/WEX/Engine on Linux systems, and is C:\Program Files\ibm\WEX\Engine on Microsoft Windows systems, though the installation directory can be changed when installing the software.
Warning: You must not back up the Watson Explorer Engine installation directory while any index updates are taking place. This is primarily a scheduling issue - make sure that index updates are only scheduled to occur after backups have completed. If indices are being updated while they are being backed up, they may not be usable after being restored. Similarly, if your backup application locks files before backing them up, Watson Explorer Engine might terminate abnormally because it cannot write to the files and directories that it requires.

At a minimum, you will want to add the files data/users.xml and data/repository.xml to your daily backups. The users.xml contains the user accounts for your Watson Explorer Engine installation. The repository.xml file contains the configuration code for all of your search applications.

Note: If you are developing collaborative search applications and have enabled annotation backups as described in the Express Tagging section of the Watson Explorer Engine User Manual, you will also want to add the directory where these automatic backups are stored to your system backups. The backups for search collection annotations are stored in the directory data/tag-backup, relative to the directory where Watson Explorer Engine is installed on your system. The backup file for search collection annotations is updated each time any annotation in that collection is added or modified, so you might want to back up this directory more frequently than you perform standard system backups. This is easily done by creating a shell script or batch file that runs at scheduled intervals and updates remote copies of the files in this directory if they have been modified. For information about restoring annotations from a backup file, see Restoring Annotations from a Backup File.

Beyond these critical files, deciding what to back up depends on how you are using Watson Explorer Engine and the types of search applications that you are creating. For example, if your Watson Explorer Engine applications are primarily meta-search applications and therefore do not themselves crawl or index other sources of online information, you may not want to back up anything beyond the files mentioned earlier in this section.

However, if you are crawling data repositories to create your own search collections, all of the indices, log files, and other data from all sources that you are crawling will also be stored under your Watson Explorer Engine installation directory by default. The crawled and indexed data for a Watson Explorer Engine search collection will be stored in the directory data/search-collections/XYZ/name, where name is the name of the search collection, and XYZ is the first three bytes of an internal hash that was calculated from that name. Storing search collections in distinct directories helps prevent directory size and performance issues.

Important: Before Velocity 7.5, all search collection data was stored by default in the data/collections directory of an installation. Any search data for search collections that you created before Velocity 7.5 in an installation that you have upgraded to Velocity 7.5 will still be located in the data/collections directory of the installation. This directory should therefore also be backed up if you want to back up those collections, taking into account the same considerations discussed in this section for backing up any search collection.
Note: Search collection data location is configurable on a per-collection basis in the Directories section of the Configuration > Meta tab for that search collection in the Watson Explorer Engine administration tool. If you manually configure the location where the data for any search collections is stored, you also must back up those directories if you are backing up your search collection data.

The files that make up a search collection can be very large, depending on the amount and type of data that you are crawling. You must make the traditional system administrator's decision regarding the trade-offs between increased backup time and backup storage requirements compared to the time it would take for Watson Explorer Engine to re-crawl the data and re-create the indexes (or update them if you have restored them from backups).

Note: If you are using Watson Explorer Engine to crawl and index data, the index and log files that it creates and uses are large, binary files. Therefore, it is not really possible to make incremental backups of these files with any granularity less than the frequency with which they are updated by new crawls or index updates. Most backup software uses file creation and modification times to determine which files have changed and thus need to be backed up. The timestamps of Watson Explorer Engine log and index files will be updated each time that index data is actually updated in response to a recrawl or refresh request. (Refresh requests will only actually update an index if data has changed on the resource that is being indexed.) Some backup software supports saving only the changes between two versions of a file, but binary files such as these will usually appear to be 100% different than previous versions of these files.

In general, if re-indexing your search collections does not take very long (or takes less time than it takes to restore selected files from backups), you may not want to back up anything other than the files mentioned earlier in this section.

If re-indexing your search collections takes a significant amount of time but is done infrequently, you will only need to back up your Watson Explorer Engine installation directory after each time that you re-index your search collections. This meshes nicely with standard incremental backup mechanisms, which only back up files that have changed since a given date or since backups were last run.

If re-indexing your search collections takes a significant amount of time but is done daily, you must determine, on average, how long it takes to re-crawl your information sources and update an existing index. If daily updates take a relatively small amount of time, you might want to back up your Watson Explorer Engine installation directory weekly and, whenever necessary, update them after they have been restored. If the time that it takes to restore indices from backups and update them approaches the amount of time it takes to create them, you may not want to back them up at all.

IBM strongly recommends installing your Watson Explorer Engine software on RAID storage so that the failure of a single disk will not cause your Watson Explorer Engine applications to fail. If your applications require high availability, you can take advantage of Watson Explorer Engine features such as Distributed Indexing to help protect against system failures. You might also want to consider collocation to protect against local or regional power or network failures.

Tip: For general information about hardware/software requirements and minimum system configurations, see Requirements.