How SuperC matches input files

When SuperC compares files, it determines matching and missing lines or words based only on the data content of the input files. It does not use any synchronization data, such as column or sequence numbers, to find matching file sections. It does not use the common “start at the top”, then look-ahead or look-back method to determine large sections of matching data. Neither does it sort the data before comparing.

SuperC is unique in that, except for files that are identical, it does not determine matching sections until it has completely read both files. Missing data units are units that are out of sequence, as opposed to units that have been deleted from a file. During a comparison, SuperC finds all matches, locates the largest set of matching data units, and recursively allows this compare set to divide the file into additional partitioned subsections. All new subsections are processed for corresponding matches. The subprocess ends when no more matches can be found within corresponding new and old file partitioned subsections. Sections classified as inserted or deleted are corresponding areas for which SuperC could not find a match.

Figure 1 shows an example of a comparison of two files that are identified as having lines represented by A, B, C, ... F. The SuperC algorithm attempts to find the best match set from the input lines. Notice how the match set requires consideration of duplicate lines.

Figure 1. Find match example

      New File Lines                              Old File Lines

         ───A───  ────────────Matched Line─────────  ───A───

Inserted ───B───                                     ───I─── Deleted

Inserted ───C───    ┌─────────Matched Line─────────  ───D─── Largest  ──┐
                    │                                                   │
         ───D───  ──┘  ┌──────Matched Line─────────  ───E─── Set        │
                       │                                                │
         ───E───  ─────┘  ┌───Matched Line─────────  ───F─── Unchanged──┘
                          │
         ───F───  ────────┘                          ───B─── Deleted

Inserted ───A───                                     ───C─── Deleted

         ───H───  ────────────Matched Line─────────  ───H───

         ───A───  ────────────Matched Line─────────  ───A───

                                                     ───A─── Deleted

       Sequence                 Left Side               Right Side

       Largest Set           ─    D E F    Divides Set    D E F
       Top Set               ─    A        Matches        A
       Leftover Top Set      ─    B C      Mismatches     I
       Largest Bottom Match  ─    H A      Matches        H A
       Leftover Bottom Set   ─    A        Mismatches     B C A

  Note:  The inserted &odq.A&cdq. on the lower left cannot connect with the
         deleted &odq.A&cdq. on the bottom right due to H A barrier.