# backbone_introduction

library(backbone)

# Weighted Graphs and Backbones

## Introduction

In a graph $$G$$, edges are either present (i.e. $$G_{ij}=1$$) or absent (i.e. $$G_{ij}=0$$). However in a weighted or valued graph, edges can take a range of values that may capture such properties as the strength or capacity of the edge. Although weighted graphs contain a large amount of information, there are some cases (e.g. visualization, application of statistical models not developed for weighted graphs) where it is useful to reduce this information by focusing on an unweighted subgraph that contains only the most important edges. We call this subgraph the backbone of $$G$$, which we denote as $$G’$$. Extracting $$G’$$ from $$G$$ requires deciding which edges to preserve. This usually involves selecting a threshold $$T_{ij}$$ such that edges are preserved if they are above the threshold (i.e. $$G_{ij}’=1$$ if $$G_{ij} > T_{ij}$$), and omitted if they are below the threshold (i.e. $$G_{ij}’=0$$ if $$G_{ij} < T_{ij}$$). It is also possible to extract a signed backbone by selecting upper $$T^+_{ij}$$ and lower $$T^-_{ij}$$ thresholds such that $$G_{ij}’=1$$ if $$G_{ij} > T^+_{ij}$$, $$G_{ij}’=-1$$ if $$G_{ij} < T^-_{ij}$$, and $$G_{ij}’=0$$ if $$G_{ij} > T^-_{ij}$$ and $$G_{ij} < T^+_{ij}$$. The key to all backbone extraction methods lies in the selection of $$T$$. The backbone package provides several different methods for selecting $$T$$ and thus extracting $$G’$$ from $$G$$.

## Example data

We outline the use of the backbone package with Davis, Gardner, and Gardner’s Southern Women Dataset (Davis, Gardner, and Gardner 1941), which can be accessed via (Repository 2006). This data takes the form of a bipartite graph $$B$$ containing 18 women (rows) and 14 social events (columns) taking place over a nine month period. In $$B$$, $$B_{ij} = 1$$ if women $$i$$ attended event $$j$$, and otherwise is 0. Let’s take a look at the Davis dataset included in this package to see that it is bipartite.

We see that our two sets of vertices are women and events attended.

A weighted graph $$G$$ can be constructed from $$B$$ via bipartite projection, where $$G = BB^T$$ and $$G_{ij}$$ contains the number of events that both woman $$i$$ and woman $$j$$ attended. Looking at the matrix of southern women and events attended above, we see that Evelyn and Charlotte have attended three of the same events. This means that $$G_{15} = 3$$ in the projection, shown below.

In this vignette, we demonstrate using the backbone package to extract the backbone of $$G$$, which involves deciding whether to preserve an edge between Evelyn and Charlotte in $$G’$$, and similarly for all other edges in $$G$$.

# General Backbone Methods

In this section, we will describe backbone methods that can be applied to any weighted graph, whether the weights are present in a natively unipartite graph, or are the product of a bipartite projection (as is the case in our example data).

## Universal Backbone: universal( )

The simplest approach to backbone extraction applies a single threshold $$T$$ to all edges, and is achieved using the universal() function. The universal() function allows the user to extract a binary backbone by selecting a single threshold $$T$$, or extract a signed backbone by selecting upper and lower thresholds $$T^+$$ and $$T^-$$.

The universal( ) function has four parameters,

• M, Matrix: a weighted adjacency matrix or a bipartite adjacency matrix
• upper, Real or FUN: upper threshold value or function to be applied to the edge weights. Default is 0.
• lower, Real or FUN: lower threshold value or function to be applied to the edge weights. Default is NULL.
• bipartite Boolean: TRUE if bipartite matrix, FALSE if weighted matrix. Default is FALSE.

## The Stochastic Degree Sequence Model: sdsm( )

The stochastic degree sequence model compares an edge’s observed weight, $$G_{ij}$$ to the distribution of weights expected in a projection obtained from a random bipartite network where both the row vertex degrees and column vertex degrees are approximately fixed. This method of backbone extraction was developed in (Neal 2014). The construction of $$B^*$$ involves a series of steps:

1. The $$\beta$$ parameters in $$Pr(B_{ij}=1) = \beta_0 +\beta_1 B_i + \beta_2 B_j +\beta_3 (B_i \times B_j)$$ are estimated using a binomial regression (e.g. logit, probit, complementary log-log, etc.), where $$B_i$$ and $$B_j$$ are the row vertex and column vertex degrees in $$B$$, respectively.
2. The fitted parameters are used to compute the predicted probability that $$B_{ij}=1$$.
3. Either $$B^*$$ is constructed via the Poisson-Binomial distribution, or $$B^*$$ is constructed such that $$B_{ij}^*$$ is the outcome of a single Bernouilli trial with $$Pr(B_{ij}=1)$$ probability of success.

The sdsm( ) function has nine parameters,

• B, Matrix: Bipartite adjacency matrix
• trials, Integer: Number of random bipartite graphs generated. Default is 0.
• model, String: A generalized linear model (glm) used to generate random bipartite graphs. The model parameter can take in a link function, as described by the stats package under stats::glm and stats::family. This can be one of c('logit', 'probit', 'cauchit', 'log', 'cloglog'). Default is ‘logit’.
• sparse, Boolean: If sparse matrix manipulations should be used. Default is TRUE.
• maxiter, Integer: Maximum number of iterations if “model” is a glm. Default is 25.
• dyad, vector length 2: two row entries i,j. Saves each value of $$G^*_{ij}$$, which is useful for visualizing an example of the empirical null edge weight distribution generated by the model. These correspond to the row and column indices of a cell in the projected matrix , and can be written as their string row names or as numeric values. Default is NULL. Only included if trials > 0.
• alpha Real: proposed alpha threshold to be used for determining statistical significance of edges. Default is 0.05.
• tolerance Real: tolerance for p-value computation using RNA poisson-binomial approximation. Default is 0.
• progress, Boolean: If utils::txtProgressBar should be used to measure progress. Default is FALSE.

If trials > 0, the function uses repeat Bernoulli trials to compute the proportions, using the following steps: During each iteration, sdsm computes a new B* matrix using probabilities computed in the glm. This is a random bipartite matrix with about the same row and column sums as the original matrix B. If the dyad_parameter is indicated to be used in the parameters, when the B* matrix is projected, the projected value for the corresponding row and column will be saved. This allows the user to see the distribution of the edge weights for desired row and column.

If trials = 0, the proportion of edges above or below the observed values are computed using the Poisson Binomial distribution. These values are approximated using either a Discrete Fourier Transform (DFT method) or a Refined Normal Approximation (RNA method). These functions are described by . The RNA method is used by default, unless the computed value is within the margin of ‘alpha’-‘tolerance’ and ‘alpha’+‘tolerance’, the DFT method is used.

The sdsm() function returns a list of the following:

• positive: matrix of the proportion of times $$G_{ij}$$ is above the corresponding entry in $$G^*$$
• negative: matrix of the proportion of times $$G_{ij}$$ is below the corresponding entry in $$G^*$$
• dyad_values: list of edge weights for $$i,j$$ in each $$G^*$$
• summary: a data frame summary of the inputted matrix and the model used including: model name, number of rows, skew of row sums, number of columns, skew of column sums, and running time.

Below are three different ways to compute the backbone using the stochastic degree sequence model. The first computes each proportion using the RNA Poisson Binomial method. The second computes each proportion using the DFT Poisson Binomial method. The third computes each proportion using repeat Bernoulli trials.

The sdsm_bt\$dyad_values output is a list of the $$G_{Evelyn,Charlotte}^*$$ values for each of the 100 trials, which in these data corresponds to the number of parties Evelyn and Charlotte would be expected to simultaneously attend if: (a) the number of parties attended by Evelyn was approximately fixed, (b) the number of parties attended by Charlotte was approximately fixed, and (c) the number of attendees at each party was approximately fixed. Because we have provided only a positive matrix, backbone.extract() returns a binary backbone matrix by conducting a one-tailed significance test in which alpha is $$0.1$$.

# Future

The backbone package will be updated to contain additional backbone extraction methods that are used in the current literature.

# References

Carstens, C. J. 2015. “Proof of Uniform Sampling of Binary Matrices with Fixed Row Sums and Column Sums for the Fast Curveball Algorithm.” Physical Review E 91 (4). https://doi.org/10.1103/PhysRevE.91.042812.

Davis, Allison, Burleigh B Gardner, and Mary R Gardner. 1941. Deep South: A Social Anthropological Study of Caste and Class. University of Chicago Press. https://doi.org/10.1177/0002716242220001105.

Neal, Zachary. 2013. “Identifying Statistically Significant Edges in One-Mode Projections.” Social Network Analysis and Mining 3 (4): 915–24. https://doi.org/10.1007/s13278-013-0107-y.

———. 2014. “The Backbone of Bipartite Projections: Inferring Relationships from Co-Authorship, Co-Sponsorship, Co-Attendance and Other Co-Behaviors.” Social Networks 39 (October): 84–97. https://doi.org/10.1016/j.socnet.2014.06.001.

Repository, UCI Network Data. 2006. “Southern Women Data Set.” https://networkdata.ics.uci.edu/netdata/html/davis.html.

Strona, Giovanni, Domenico Nappo, Francesco Boccacci, Simone Fattorini, and Jesus San-Miguel-Ayanz. 2014. “A Fast and Unbiased Procedure to Randomize Ecological Binary Matrices with Fixed Row and Column Totals.” Nature Communications 5 (June): 4114. https://doi.org/10.1038/ncomms5114.

Strona, Giovanni, Werner Ulrich, and Nicholas J. Gotelli. 2018. “Bi-Dimensional Null Model Analysis of Presence-Absence Binary Matrices.” Ecology 99 (1): 103–15. https://doi.org/10.1002/ecy.2043.

Zweig, Katharina Anna, and Michael Kaufmann. 2011. “A Systematic Approach to the One-Mode Projection of Bipartite Graphs.” Social Network Analysis and Mining 1 (3): 187–218. https://doi.org/10.1007/s13278-011-0021-0.