Business requirements frequently demand that we simplify data by grouping it. For example, a health care company may have a list of individual health care claims, but only really care about the total deductable paid by each card holder. While it’s possible to group data by using map rules, smartly designed input trees can greatly increase performance while decreasing complexity. In this article we examine the control break technique—and the performance of the 1950s and 1960s Yankees.
For our example, we take a baseball historian looking to compare Yankee manager performance. She has a list of Yankee results from the 50s and 60s, as seen below:
We begin by defining a basic tree to describe the data. We create a group called IncomingRecordGroup that contains any number of Seasons. Within each season, we create types for the Year, Manager, Wins and Losses.
Instead, we use introduce a new group in the tree (ManagerEra) and create a component rule. This technique is commonly referred to as a control break.
Manager Type:$ = Manager Type:$[LAST]
means that every season contained within a ManagerEra must have the same manager as the previous season. Otherwise, the parser creates a new ManagerEra. For example, as WTX reads through Yankee season data, it notes that the manager in 1960 was the same as in 1959 (Casey Stengel). Thus 1959 and 1960 are included in the same ManagerEra group. But when WTX reaches 1961, it recognizes that Ralph Houk (Manager Type:$) is not the same as Casey Stengel (Manager Type:$[LAST]). As a results, a new ManagerEra is started and continues until 1964 when Yogi Berra begins his brief tenure.
The payoff for all this upfront design work comes when mapping. Now, rather than having to write map rules to pick apart managers (or patients or teachers or any number of other pieces of data), they are conveniently grouped together for easy processing.
Here is a simple map using our finished tree: