Specifying Ranked Conditions for a Merge
A Ranked Condition merge can be considered as a left sided outer join merge by condition; the left side of the merge is the primary data set where each record is an event. For example, in a model that is used to find patterns in crime data, each record in the primary data set would be a crime and its associated information (location, type, and so on). In this example, the right side might contain the relevant geospatial data sets.
The merge uses both a merge condition and a ranking expression. The merge condition can use a geospatial function such as within or close_to. During the merge, all of the fields in the right side data sets are added to the left side data set but multiple matches result in a list field. For example:
- Left side: Crime data
- Right side: Counties data set and roads data set
- Merge conditions: Crime data within counties and close_to roads, along with a definition of what counts as close_to.
In this example, if a crime occurred within the required close_to distance of three roads (and the number of matches to be returned is set to at least three), then all three roads are returned as a list item.
By setting the merge method to Ranked condition, you can specify one or more conditions to be satisfied for the merge to take place.
Primary dataset Select the primary data set for the merge; the fields from all other data sets are added to the data set you select. This can be considered as the left side of an outer join merge.
When you select a primary data set, all the other input data sets that are connected to the Merge node are automatically listed in the Merges table.
Add tags to duplicate field names to avoid merge conflicts If two or more of the data sets to be merged contain the same field names, select this check box to add a different prefix tag to the start of the field column headers. For example, if there are two fields that are called Name the result of the merge would contain 1_Name and 2_Name. If the tag is renamed in the data source, the new name is used instead of the numbered prefix tag. If you do not select this check box, and there are duplicate names in the data, a warning is displayed to the right of the check box.
Merges
- Dataset
- Shows the name of the secondary data sets that are connected as inputs to the Merge node. By default, where there is more than one secondary data set, they are listed in the order in which they were connected to the Merge node.
- Merge Condition
-
Enter the unique conditions for merging each of the data sets in the table with the primary data set. You can either type the conditions directly into the cell, or build them with the aid of the Expression Builder by clicking the calculator symbol to the right of the cell. For example, you might use geospatial predicates to create a merge condition that places crime data from one data set within the county data of another data set. The default merge condition depends on the geospatial measurement level, as shown in the list below.
- Point, LineString, MultiPoint, MultiLineString - default condition of close_to.
- Polygon, MultiPolygon - default condition of within.
For more information about these levels, see Geospatial measurement sublevels.
If a data set contains multiple geospatial fields of different types, the default condition that is used depends on the first measurement level that is found in the data, in the following descending order.
- Point
- LineString
- Polygon
Note: Defaults are only available when there is a geospatial data field in the secondary database. - Ranking Expression
-
Specify an expression to rank the merging of the data sets; this expression is used to sort multiple matches into an order that is based on the ranking criteria. You can either type the conditions directly into the cell, or build them with the aid of the Expression Builder by clicking the calculator symbol to the right of the cell.
Default ranking expressions of distance and area are provided in the Expression Builder and both rank low to high, meaning that, for example, the top match for distance is the smallest value. An example of ranking by distance is when the primary data set contains crimes and their associated location and each other data set contains objects with locations; in this case the distance between the crimes and the objects can be used as a ranking criteria. The default ranking expression depends on the geospatial measurement level, as shown in the list below.
- Point, LineString, MultiPoint, MultiLineString - the default expression is distance.
- Polygon, MultiPolygon - the default expression is area.
Note: Defaults are only available when there is a geospatial data field in the secondary database. - Number of Matches
- Specify the number of matches that are returned, based on the condition and ranking expressions.
The default number of matches depend on the geospatial measurement level in the secondary data set,
as shown in the list below; however, you can double-click in the cell to enter your own value, up to
a maximum of 100.
- Point, LineString, MultiPoint, MultiLineString - default value of 3.
- Polygon, MultiPolygon - default value of 1.
- Data set that contains no geospatial fields - default value of 1.
As an example, if you set up a merge that is based on a Merge Condition of close_to and a Ranking Expression of distance, the top three (closest) matches from the secondary data sets to each record in the primary data set are returned as the values in the resultant list field.