Main Idea
The main idea of the CMPS method is that:
- we take the first signature as the comparison signature
(
x
or bullet signature of “2-3”) and cut it into consecutive and non-overlapping basis segments of the same length. In this case, we set the length of a basis segment to be 50 units, and we have 22 basis segments in total for bullet signaturex
.
- for each basis segment, we compute the cross-correlation function
(ccf) between the basis segment and the reference signature
(
y
or bullet signature of “1-2”)
- for the
ccf
curve, theposition
represents the shift of the segment. A negative value means a shift to the left, a positive value means a shift to the right, and 0 means no shift (the segment stays at its original position in the reference signature); - we are interested in the peaks in the ccf curve and the positions of those peaks (as indicated by the red vertical line in the plot above). In other words, if we shift the segment, which position would give us the “best fit”?
- If two signatures are from a KM comparison, most of the basis segments should agree with each other on the position of the best fit. Then these segments are called the “Congruent Matching Profile Segments (CMPS)”.
Ideally, if two signatures are identical, we are expecting the position of the highest peak in the ccf curve remains the same across all ccf curves (we only show 7 segments here);
But in the real case, the basis segments might not achieve a final agreement, but we have the majority;
We mark the 5 highest peaks for each ccf curve because the position of the “highest peak” might not be the best one.
- each ccf curve votes for 5 candidate positions, then we ask two questions in order to obtain the CMPS number/score:
which position receives the most votes? -> the best position (indicated by the red vertical line)
how many segments have voted for the best position? -> CMPS score
If we focus on these 7 segments only, and have a very short tolerance zone, the CMPS number is 6.
(If we consider all 22 segments, and have a default tolerance zone (+/- 25 units), the CMPS number is 20.)
- false positive: how can the segments vote more wisely? -> Multi Segment Lengths Strategy
by increasing the segment length, one can reduce the number of “false positive” peaks.
the first scale level is the original length of segment 7; for the second scale level, we double its length while keeping its center. That is, we include 25 more units from both the left and right side of the segment 7 to obtain a segment of 100 units length. For the third scale level, we double the segment length again to obtain a segment of length 200.
we choose five peaks at scale level 1; three peaks at scale level 2; one peak at scale level 3
the peak shared by all three scale levels is a consistent correlation peak (ccp). And the position of the ccp is our best choice. Sometimes a ccp might not be found. Trying to identify a ccp for each basis segment is called a “multi segment lengths” strategy.
The following plots (generated by
cmpsR::cmps_segment_plot
) summarize the information of the two above plots. It shows that segment 7 finds a consistent correlation peak (ccp) at a position near 0 (position-6
).<- extract_feature_cmps(x, y, include = "full_result") cmps <- cmpsR::cmps_segment_plot(cmps, seg_idx = 7) cmps_plot_list ::ggarrange(plotlist = unlist(cmps_plot_list, recursive = FALSE), ggpubrnrow = 3, ncol = 2)