# iGraphMatch

library(iGraphMatch)

#### General Description of Graph Matching

Graph matching is an increasingly important problem which can be applied to a wide variety of fields including biology, neuroscience, pattern recognition and machine learning, to name a few. For example, as a fundamental tool in solving pattern recognition problem, graph matching is widely used in computer vision. In this problem, one seeks to find a correspondence between local features of the image, which are labeled as nodes of the graphs. Relational aspects between features are modeled by edges of the graphs in this context.

While packages such as iGraph and GraphM also have graph matching functionality our goal is to provide a single centralized repository for graph matching which attempts to confront many of the additional pathologies of graph matching. While the iGraph package provides versatile options on descriptive network analysis and graph visualization based on igraph objects in R, Python and C/C++, it doesn’t focus on implementation of the most commonly used and cutting edge algorithms of graph matching. In contrast to iGraph, the iGraphMatch package is also more flexible in dealing with different type of graph objects, ranging from igraph objects to matrix objects. The GraphM provides tools for approximately soloving large scale graph matching problems. It implements a variety of graph methods that were proposed between 1999 and 2009, including Umeyama algorithm, Linear programming approach, Rank algorithm, QCV (Quadratic convex relaxation) algorithm and PATH (A path following) algorithm in C/C++. Its corresponding R package version RGraphM hasn’t been published yet.

g2 <- cgnp_pair$graph2 (seeds <- 1:10 <= 3) #> [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE Then just randomly pick two pairs of vertices as soft seeds. Note that soft seeds may not be correct, but soft seeds may improve during the matching procedure. Let’s first pick two bad soft seeds {4,4} and {8,6} which is a combination of good soft seed and bad soft seed: bad_soft_seeds <- rbind(c(4,4),c(8,6)) Initialization of different start matrix is very convenient in R with iGraphPackage. Function init_start returns a $$n_{ns}$$-by-$$n_{ns}$$ matrix where $$n_{ns}$$ denotes the number of nonseeds vertices. nns <- 7 ns <- 3 (start_bari <- init_start(start = "bari", nns = nns, ns = ns, soft_seeds = bad_soft_seeds)) #> Sparse part #> 7 x 7 sparse Matrix of class "dgCMatrix" #> #> [1,] 1 . . . . . . #> [2,] . . . . . . . #> [3,] . . . . . . . #> [4,] . . . . . . . #> [5,] . . 1 . . . . #> [6,] . . . . . . . #> [7,] . . . . . . . #> plus left factor #> 7 x 1 sparse Matrix of class "dgCMatrix" #> #> [1,] . #> [2,] 1 #> [3,] 1 #> [4,] 1 #> [5,] . #> [6,] 1 #> [7,] 1 #> times right factor transpose #> 7 x 1 sparse Matrix of class "dgCMatrix" #> #> [1,] . #> [2,] 0.2 #> [3,] . #> [4,] 0.2 #> [5,] 0.2 #> [6,] 0.2 #> [7,] 0.2 set.seed(123) (start_rds <- init_start(start = "rds", nns = nns, ns = ns, soft_seeds = bad_soft_seeds)) #> 7 x 7 sparse Matrix of class "dgCMatrix" #> #> [1,] 1 . . . . . . #> [2,] . 0.08224823 . 0.01678766 0.32382493 0.35940770 0.2177315 #> [3,] . 0.26796914 . 0.23130193 0.18235230 0.11682563 0.2015510 #> [4,] . 0.13781874 . 0.38747748 0.27018824 0.01979391 0.1847216 #> [5,] . . 1 . . . . #> [6,] . 0.24665273 . 0.19846281 0.18927607 0.12792141 0.2376870 #> [7,] . 0.26531116 . 0.16597012 0.03435846 0.37605135 0.1583089 (start_convex <- init_start(start = "convex", nns = nns, ns = ns, soft_seeds = bad_soft_seeds, A = g1, B = g2, seeds = seeds)) #> Sparse part #> 7 x 7 sparse Matrix of class "dgCMatrix" #> #> [1,] 0.29659736 0.16740579 . 0.12411063 0.20217656 0.07348296 #> [2,] 0.05777703 0.14300415 . . 0.07796460 0.63939292 #> [3,] 0.23157496 0.41937506 . . 0.12471556 0.13051394 #> [4,] 0.09105435 0.05728915 0.2497145 0.49848492 0.05364411 . #> [5,] . 0.07032994 0.3898809 0.05364411 0.36359585 . #> [6,] 0.29561797 0.06519647 0.1334635 0.29554317 . 0.09343773 #> [7,] 0.02737833 0.04918226 0.2269410 . 0.14968614 0.03495526 #> #> [1,] 0.13622670 #> [2,] 0.05364411 #> [3,] 0.06560330 #> [4,] 0.02159578 #> [5,] 0.12254920 #> [6,] 0.08852395 #> [7,] 0.48363979 #> plus left factor #> 7 x 1 sparse Matrix of class "dgCMatrix" #> #> [1,] . #> [2,] 0.02821718 #> [3,] 0.02821718 #> [4,] 0.02821718 #> [5,] . #> [6,] 0.02821718 #> [7,] 0.02821718 #> times right factor transpose #> 7 x 1 sparse Matrix of class "dgCMatrix" #> #> [1,] . #> [2,] 0.2 #> [3,] . #> [4,] 0.2 #> [5,] 0.2 #> [6,] 0.2 #> [7,] 0.2 Then implement seeded graph matching using the FAQ algorithm in R by using graph_match_FW function. To incorporate soft seeds, we specify the start argument in graph_match_FW function accordingly. Note that in this example, we assume the true permutation to be the identity matrix. set.seed(123) match_bari <- graph_match_FW(g1, g2, seeds, start = start_bari) match_rds <- graph_match_FW(g1, g2, seeds, start = start_rds) match_convex <- graph_match_FW(g1, g2, seeds, start = start_convex) err_bari_bad <- mean(match_bari$corr$corr_A[!seeds] != match_bari$corr$corr_B[!seeds]) err_rds_bad <- mean(match_rds$corr$corr_A[!seeds] != match_rds$corr$corr_B[!seeds]) err_convex_bad <- mean(match_convex$corr$corr_A[!seeds] != match_convex$corrcorr_B[!seeds]) Since in this example both of the soft seeds are incorrect, intuitively not incorporating such incorrect information might yield better matching results. This motivates us to perform a comparative experiment on whether or not to incorporate such incorrect soft seeds. The R code for performing graph matching with only the hard seeds is similar to soft seeding except for different specification of the start argument. set.seed(123) match_nss_bari <- graph_match_FW(g1, g2, seeds, start = "bari") match_nss_rds <- graph_match_FW(g1, g2, seeds, start = "rds") match_nss_convex <- graph_match_FW(g1, g2, seeds, start = "convex") For completeness, we also include an example of good soft seeds, say {4,4} and {8,8}. Then we follow the same procedure as in the bad soft seeds case, and yield the matching results for good soft seeds when matching the same graphs. Since the codes corresponding soft seeding implementation are the same, we’ll skip the codes and only show the results comparing the three cases. good_soft_seeds <- rbind(c(4,4),c(8,8)) Matching Errors With Various Initialization Methods bari rds convex good soft seeds 0.7142857 0.5714286 0.7142857 bad soft seeds 0.7142857 0.7142857 0.7142857 non soft seeds 0.8571429 0.8571429 0.8571429 Table 2 presents the matching error of nonseeded vertex sets with various initialization methods. Compared with using a mixture of good soft seeds and bad soft seeds and matching without soft seeds, incporating good soft seeds decrease or maintain the same matching error for all the initialization methods. Note that for the convex initialization method, we can successfully uncover the true correspondence of two graphs by using good soft seeds. Using partial good soft seeds can also improve the matching performances except for the convex case. In general, random doubly stochastic initialization yields the most stable matching result while the convex intialization method using good soft seeds achieves the highest matching accuracy among all the methods. We also illuminate the difference between various initialization methods by presenting the experiment results on larger scale graphs under more parameter settings and with more monte carlo replicates. The experiment is relatively time consuming in case of larger scale graphs, we recommend to run larger simulations on server, or consider using statistical computing methods, e.g. divide and conquer algorithm to enhance the speed of experiments. Here we just show the result figure and skip the R code. Figure 1 shows the average performance for 300 graph pairs sampled from Erdős-Rényi graph model with the settings: $$p\in\{0.1,0.2,0.5\}$$, $$\rho=0.5$$, $$n_v=250$$ and $${n_s}=10$$. Wrong soft seeds are randomly sampled from the nonseed vertices. We observe that with a moderate number of incorrect soft seeds, convex start outpreforms the other two start methods. Converting from a random doubly stochastic start to a convex start can reduce the matching error by 0.1 to 0.3. As a result, when implementing a soft seeding graph matching, if we know the proportion of incorrect soft seeds is not too high (approximately $$\le70\%$$), it’s recommendated to choose a convex start. ## Identification of Core Vertices ### Graph Matching with Junk Vertices Setting The previous discussions are all based on the setting that there is always a corresponding vertex in $$G2$$ for each vertex in $$G1$$. however, this is not always the case. Social networks offer a compelling example for this, where matching across different social platforms (or within a single time varying social network) requires the understanding that not all users will be participants in both networks. Junk vertices refer to the vertices that don’t have true alignments in the other graph. This could be modeled by correlated heterogeneous Erdős-Rényi random graphs . Definition For $$R$$ and $$\Lambda$$ symmetric, hollow matrices in [0,1]$$^{n\times n}$$, wesay $$A$$, $$B$$ are $$R$$-correlated heterogeneous Erdős-Rényi($$\Lambda$$) random graphs (abbreviated $$CorrER(\Lambda, R)$$) if: 1. $$A$$ and $$B$$ are marginally $$ER(\Lambda)$$; i.e., for all $$u$$, $$v\in [n]$$, $$u<v$$, $$A_{uv} \stackrel{\text{iid}}{\sim} Bern(\Lambda_{uv})$$ and $$B_{uv} \stackrel{\text{iid}}{\sim} Bern(\Lambda_{uv})$$, with $$A_{uv}=A_{vu}$$ and $$B_{uv}=B_{vu}$$. 2. For all $$u$$,$$v$$,$$w$$,$$r\in [n]$$, $$u<v$$, $$w<r$$, it holds that $$A_{uv}$$ and $$B_{wr}$$ are independent unless $$u=w$$ and $$v=r$$, in which case the correlation between $$A_{uv}$$ and $$B_{uv}$$ is $$R_{u,v}\ge 0$$. At one extreme, if $$R=[0]_n$$ then the graphs are independent $$ER(\Lambda)$$, and at the other extreme, if $$R=J_n$$ then $$A$$ and $$B$$ are isomorphic almost surely. This model is a generalization of the homogeneous correlated $$ER$$ model discussed earlier, and similarly allows for the addition of “junk” vertices—those without a probabilistic match across graphs—by setting $$R=R_k\oplus[0]_{n-k}$$ for some $$k\le n$$. $$(A,B) \sim CorrER(\Lambda, R)$$ are matchable, if argmin$$_{P\in \Pi(n)} \Vert AP-PB \Vert _F={I_n}$$, where $$I_n$$ denotes the identity matrix. Our goals are to uncover the alignment between core vertices, i.e those where $$R_{u,v}>0$$ for some $$u,v \in [n]$$, and to detect which vertices are core versus junk vertices. ### Different Measures for Goodness of Matching Having a measure for goodness of matching is important. The measure can be used in finding core vertices in the setting with junk vertices. Since we can rank all the vertices in the order of goodness of matching based on the measure, it’s also useful for choosing the best matched vertices. The best matched core vertices can then be used as additional seeds in the next iteration of graph matching, and we call this process adaptive seeding which is introduced in the next section. Here we list three different measures for goodness of matching, row permutation statistics, row difference and row correlation. Row permutation statstics is based on a graph matching variant of the permutation test. In terms of a vertex $$v$$, test the hypotheses $H_0^{(v)}: \forall P\in\mathcal{P}, u\neq v, corr(A_{vu},(PBP^T)_{vu})=0,$ $H_A^{(v)}: \exists P\in\mathcal{P}, u\neq v, corr(A_{vu},(PBP^T)_{vu})>0$ To test, we will make use of the following relationship between edge-wise correlation and the number of induced error between $$A$$ and $$P^*B{P^*}^T$$. Namely, if $$v$$ is a core vertex (or $$v$$ is correctly matched), then the number of errors induced by $$P^*$$ across the neighborhoods of $$v$$ in $$A$$ and $$B$$ (i.e., $$\Vert(AP^*-P^*B)_{v,\bullet}\Vert_1$$) should be significantly smaller than the number of errors induced by a randomly chosen permutation $$P$$ (i.e., $$\Vert(AP-PB)_{v,\bullet}\Vert_1$$). With this in mind, we define $$\Delta_v(P)=\Vert(AP-PB)_{v,\bullet}\Vert_1$$ and let $$\mathbb{E}_P$$ and Var$$_P$$ denote the conditional expection and variance of these quantities with respect to uniform sampling of $$P$$ over all permutation matrices. Inspired by the permutation-test, we define the row permutation statistic as: $T_p(v,P^*):=\frac{\Delta_v(P^*)-\mathbb{E}_P\Delta_v(P)}{\sqrt{Var_P\Delta_v(P)}}$ Intuitively, the larger $$T_p(v)$$, the more likely $$v$$ is to be a core vertex (or the more likely we find the true alignment to $$v$$). Row difference is defined as the L-1 norm of $$A$$ and $$P^*B{P^*}^T$$, which is $T_d(v,P^*):=\Vert A_{v\bullet}-(P^*B{P^*}^T)_{v\bullet}\Vert_1$ Intuitively, correctly matched vertex $$v$$ (or core vertex $$v$$) should induce smaller $$T_d(v,P^*)$$. Row correlation is a statistics for the correlation between $$A$$ and $$P^*B{P^*}^T$$ which is defined as $T_c(v,P^*):=1-corr(A_{v\bullet},(P^*B{P^*}^T)_{v\bullet})$ If $$v$$ is correctly matched, then the correlation between the neighborhoods of $$v$$ in $$A$$ and $$B$$ should be smaller. Thus, larger $$T_c(v)$$ value indicates higher chance of vertex $$v$$ being correctly matched (or more likely $$v$$ is to be a core vertex). ### Algorithm of Finding Core Vertices We next develop an approach to identify core vertices correctly. The main tool that we utilize is a graph matching variant of permutation test. We will use the test statistic $$T(\bullet)$$ to rank vertices based on the likelihood they are core vertices. Suppose $$A,B\in\{0,1\}^{(n_c+n_j)\times (n_c+n_j)}$$ are the adjacency matrices corresponding to graphs $$G1$$ and $$G2$$, where $$n_c$$ denotes the number of core vertices and $$n_j$$ denotes the number of junk vertices. Denote the available seeded vertices as $$S$$. Algorithm of finding core vertices consists of the following steps. First, use the available seeded vertices $$S$$ to match $$A$$ and $$B$$ yielding the optimal permutation $$P^*$$. Based on the matching result, then plug in $$P^*$$ to compute the value of $$T(v,P^*)$$ for each vertex $$v$$. Finally rank all the vertices via the decreasing value of $$T(v,P^*)$$ with increasing likelihood of being core vertices, that is the first $$n_j$$ vertices are identified as junk vertices with decreasing likelihood. ### Simulations Now implement algorithm of finding core vertices in R with iGraphMatch package. First we sample a pair of graphs with 50 vertices from the correlated Erdős-Rényi graph model and let 5 out of the total vertices to be junk vertices using function sample_correlated_gnp_pair_w_junk. We also assume the first 10 vertices to be seeds. Then apply the FAQ algorithm to match two graphs. set.seed(5) cgnp_pair <- sample_correlated_gnp_pair_w_junk(n = 50, corr = .5, p = .5, ncore = 45) g1 <- cgnp_pairgraph1
g2 <- cgnp_pair$graph2 seeds <- 1 : 50 <= 10 core <- 1 : 50 <= 45 junk <- !core non_seeds <- !seeds match <- graph_match_FW(g1, g2, seeds, start = "rds") Then implement the algorithm for finding core vertices, we use row permutation statistic as our measure in this example. The main steps of the algorithm involves calculating values of the row permutation statistic for each nonseed vertex and rank them in the order of increasing value of the statistic. These steps can be implemented by using the best_matches function. The best_matches function has the functionality of finding best matched vertices and identifying core vertices. Since we want to rank all the non-seed vertices in terms of their row permutation statistics, set argument x to be non_seeds, which denotes the vertices we are interested in. Also set argument num to be the number of non-seeds, which denotes the number of top ranked vertices we want to get. By returning the result in A_best, we get the indices corresponding to vertices in $$G1$$ in the order of decreasing likelihood of being core vertices. r <- best_matches(A = g1, B = g2, match = match, measure = "row_perm_stat", num = sum(non_seeds))$A_best
r
#>  [1] 18 31 32 24 19 13 27 28 30 17 36 20 37 33 42 44 34 21 29 15 48 12 49 14 39
#> [26] 11 38 46 35 16 43 50 40 41 25 45 22 26 47 23
Summarization Table for Core Identification Precisions
k1 k5 k10 k25 k35
core identification precision 1 1 1 0.8687371 0.8125
Summarization Table for Junk Identification Precisions
k1 k2 k3 k4 k5
junk identification precision 0.25 0 0.08 0.21 0

Since there are 5 junk vertices, the last 5 vertices in the output are identified as junk vertices. Note that seeded vertices are not ranked. In order to evaluate the performance of core identification algorithm in this example, we calculate the precision of classification for each vertex and summarize the precisions in table 3 and table 4. Table 3 shows the identification precision for $$k=1,5,10,25,35$$, where $$k$$ denotes the rank of identified core vertices. Higher rank indicates the vertex has higher probability to be a core vertex. From table 3, the core identification algorithm achieves pretty high precision for identifying the core vertices. In contrast, the identification precision for junk vertices drops quickly as we averaging over more high rank junk vertices.

Figure shows the simulation results for two random graphs sampled from CorrER$$(0.2,0.5J_{n_c}\oplus\mathrm{0}_{n_j})$$ with $$n_j \in\{20,50\}$$ and $$n_s\in\{5,10,20\}$$ . The figure averages the results of 1000 monte carlo replicates by plotting the mean precision at each rank with lower ranks indicating core vertices and higher ranks near $$n$$ indicating junk vertices. When there are a small number of seeded vertices, the core identification algorithm doesn’t outperform the random algorithm much, which is indicated by the horizontal lines. But when we have 20 seeds, the core identification algorithm achieves substantially higher precision than the random algorithm, especially for junk vertices. As expected, when there are smaller number of junk vertices, the algorithm is able to have better performance.

## Graph Matching with Adaptive Seeds

In this part, we will show you how to conduct FW graph matching method with adding adaptive seeds iteratively. The idea is motivated by the fact that in many realistic applications, we only know a small number of seeds while the number of vertices to be matched is large. Since there are costs and difficulties in acquiring seeds, the following algorithm will provide a feasible approach to acquire additional seeds and use them in graph matching to achieve higher matching accuracy.

The inputs are the adjacency matrices of two graphs A and B. Both graphs are composed of $$n_c$$ matchable core vertices and $$n_j$$ junk vertices, vertices that don’t have a bijection in the other graph, and some available seeds $$S$$. The detailed algorithm is given in algorithm .

##Discussion The primary goal of this package is to facilitate research related to graph matching by providing a variety of useful tools in several aspects of graph matching problem, including graph matching algorithm, quality measure, random graph models etc. In iGraphMatch package we implement two graph matching algorithms, the FAQ algorithm and the convex relaxed algorithm. Both algorithms are applicable to the setting with junk vertice which is a more general case. Moreover, the package is capable of handling weighted graphs and graphs with different orders (by adding padding ). For future works, we target to implement more graph matching algorithms aiming at enhancing the computational efficiency, which is especially significant to matching two large scale graphs.