Configuring Custom SQL Seeds

The following values can be configured for this type of seed:

  • Host - The hostname or IP address of the database server.
  • Port (optional) - The port on which the database is listening.
  • Username - Username for accessing this database.
  • Password - Password for accessing this database.
  • Database system - The type of database to connect to.
  • Database name - The name of the database to retrieve results from.
  • SQL Statement - A custom SQL statement to execute.
  • Column to save for resuming (optional) -

    Provide the name (or label) of a column here in order to enable resuming a crawl or refresh that was previously stopped. The column should be a required, non-null field and should map to a Java type that implements the Serializable interface. Using a numeric (integer, long, decimal, and so on), timestamp, or date column is recommended.

    The provided SQL Statement requires one or two parameters and should be ordered first by the column. The stored value will be injected up to two times - when starting a fresh crawl, NULL will be injected as the value. For example, if using a last modified date:
                SELECT id, name, updatedAt FROM employees WHERE ? IS NULL OR updatedAt >= ? ORDER BY updatedAt ASC, name

    Note: For most databases, if a label is defined for a column, you should use that value because it will have precedence.

  • Key Column (optional) - The name of a column to use as a unique key. This is used to create unique URLs at crawl time, and to update those same URLs later.
  • Timestamp Column (optional) - The name of a column that contains a timestamp denoting when the row was last updated.
  • Fetch Size (optional) -

    The number of rows to retrieve from the database in each interaction. A large value will minimize network overhead, but will require more memory to store the results while they are being processed.

    This value will also be used to determine how frequently the connector saves its place in order to support resuming a refresh.

  • Prefer Column Labels (optional) -

    When set to true, column labels (usually defined using the AS keyword), if defined, will be used as property names for results instead of the actual column name.

    Note: Even when set to false, defined column labels may be used, depending on the database and JDBC driver version. For example, versions of the MySQL JDBC driver earlier than 5.1.6 will always return the column label, while later versions will return the column name. This is due to a change in the default behavior of the driver in order to comply with the JDBC 4.0 specification.

  • Maximum converted size (optional) - Maximum size of the data for a document. This is the largest block of memory that will be loaded at one time. Only increase this limit if you have sufficient memory on your computer.
Note: You must specify the name of a column or associated label for the Column to save for resuming field in order to be able to resume a crawl or do a partial refresh. The Database (Custom SQL) seed saves the name of this column at regular intervals as it proceeds with the crawl, and saves it again once the last row of your database has been crawled. When resuming the crawl or refreshing it, the crawl begins with the row that is identified in the saved value for this field.

The SELECT statement that is provided in the description of the Column to save for resuming field is an example of the statement that you should supply as the SQL Statement when you want to resume or refresh a crawl using the Database (Custom SQL) seed. Both of the question marks ('?') in this statement would be replaced with the saved value for the column specified in the Column to save for resuming field.