Before you create an indexing task for a new collection, make copies of the relevant default style files to ensure that you have a set of template style files in a known, stable state.
Running multiple simultaneous Verity Spider jobs on the Information Server host can cause performance problems for searches. This does not mean that you should never run indexing jobs when users might be searching, because your collections are available for searching even while indexing jobs are running. To optimize performance, try staggering your indexing jobs to avoid overloading your server.
The vspider executable, which starts the vspider application, is located in the cf_root\lib\_nti40\bin directory in Windows, and in the cf_root/lib/platform/bin directory in UNIX.
In these pathnames, cf_root refers to the ColdFusion root directory. In Windows, this is typically C:\CFusionMX; in UNIX, this is typically /opt/coldfusionmx. In UNIX, platform refers to the UNIX version of the server that runs ColdFusion: _solaris, _hpux11, or _ilnx21.
At its most basic level, a Verity Spider command consists of the following:
vspider -initialize -collection coll [options]
Where -initialize is -start or -refresh (when starting points have changed), and -collection is required to provide a target for the Verity Spider, and [options] can be a near-limitless combination of the options described later in this chapter.
c:\cfusionmx\lib\_nti40\bin\vspider -common c:\cfusionmx\lib\common
-collection c:\new -start http://localhost -indinclude *
There are dependencies for other options, depending on the nature of the indexing task. The following are some examples:
If you do not run the Verity Spider executable from its default installation directory, you must include that directory in your path. This is because the Verity Spider executable depends on other files to run properly.
For simpler reuse and archiving of your indexing commands, use the -cmdfile option for abstraction. By using an ASCII text file to store a task's options, you avoid the potential problem of using special characters in an option's parameter value. For example, the -processbif option requires the use of "!*" and therefore any task using that option must also use the -cmdfile option.
The following sections describe the Verity Spider V3.7 command-line options. Option names are case-sensitive.
Specifies a starting point for an indexing job. You can specify multiple instances, or use multiple values in a single instance.
When you execute an indexing job from a command line, and you do not use a command file (with the -cmdfile option), you must URL-escape any special characters in the starting point. To URL-escape a special character, use "%hex-ASCII-character-number" in place of the character. For example, use /time%26/ instead of /time&/. This allows the operating system to properly process the command string.
If an indexing task halts, you can rerun the task as-is. The persistent store for the specified collection is read, and only those candidate URLs that are in the queue but not yet processed are parsed. Candidate URLs correspond to URLs of the following status, as reported by vsdb:
cand, used, inse, upda, dele, fail
Note: By using the -start option with the -refresh option, you provide a starting point for Verity Spider and therefore do not need to use at least one of the following options: -host, -domain, -nofollow, or -unlimited.
Used for updating a collection, specifies that Verity Spider process only those documents that qualify, as follows:
When you rerun an existing indexing job, Verity Spider automatically refreshes the collection. If you add or remove any of the starting points, however, you must manually specify the -refresh option to refresh existing documents.
Note: You can also use the -start option to provide a starting point for Verity Spider. If you do not use the -start option, use at least one of the following options: -host, -domain, or -nofollow. For further control, also see the -refreshtime option. If you do not use any constraint criteria, Verity Spider operates without limits and will likely index far more than you intended.