help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Biophys J. BioFAST on June 30, 2006.
doi:10.1529/biophysj.106.082560
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
biophysj.106.082560v1
91/8/2749    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Brynildsen, M. P.
Right arrow Articles by Liao, J. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Brynildsen, M. P.
Right arrow Articles by Liao, J. C.
Biophysical Journal 91:2749-2759 (2006)
© 2006 The Biophysical Society

Versatility and Connectivity Efficiency of Bipartite Transcription Networks

Mark P. Brynildsen, Linh M. Tran and James C. Liao

Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California

Correspondence: Address reprint requests to J. C. Liao, Tel.: 310-825-1656; E-mail: liaoj{at}ucla.edu.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
The modulation of promoter activity by DNA-binding transcription regulators forms a bipartite network between the regulators and genes, in which a smaller number of regulators control a much lager number of genes. To facilitate representation of gene expression data with the simplest possible network structure, we have characterized the ability of bipartite networks to describe data. This has led to the classification of two types of bipartite networks, versatile and nonversatile. Versatile networks can describe any data of the same rank, and are indistinguishable from one another. Nonversatile networks require constraints to be present in data they describe, which may be used to distinguish between different network topologies. By quantifying the ability of bipartite networks to represent data we were able to define connectivity efficiency, which is a measure of how economic the use of connections is within a network with respect to data representation and generation. We postulated that it may be desirable for an organism to maximize its gene expression range per network edge, since development of a regulatory connection may have some evolutionary cost. We found that the transcriptional regulatory networks of both Saccharomyces cerevisiae and Escherichia coli lie close to their respective connectivity efficiency maxima, suggesting that connectivity efficiency may have some evolutionary influence.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Bipartite networks have been used to represent many biological systems and engineering tasks, including gene expression regulation (1Go–6Go), signal processing (7Go,8Go), image processing (9Go–11Go), and spectrum analysis (12Go,13Go). These networks consist of a layer of sources connected to a layer of outputs, where every connection (edge) represents the influence of a source on an output (Fig. 1 A). In some cases, the output nodes are fully connected to the sources, for example, microphones recording simultaneous speeches in the same location. In others, the outputs are sparsely connected to the source signals, such as in transcriptional regulatory networks.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
 
FIGURE 1  (A) Bipartite network depicting a hypothetical transcriptional regulatory network. (B) ZA corresponding to network in panel A. (C) Formula created from ZA in panel B. (D) Table of Formula nz, and Erj from ZA in panel B.

 
In general, it is advantageous to describe data with the simplest structure possible, both for interpretation and mechanistic reasons (14Go,15Go). However, conventional bipartite network analyses such as principal component analysis (PCA) and independent component analysis, assume that networks are fully connected. For systems governed by sparsely connected networks, this assumption could lead to the deduction of unrealistic source signals (4Go,14Go,16Go,17Go). A variation of PCA, called Sparse PCA, has been developed that acknowledges this issue and attempts to alleviate it (16Go,17Go). However, Sparse PCA like its precursor, PCA, requires deduced source signals to be mutually orthogonal. Such a mathematical constraint without any phenomenological justification may hinder the ability to provide simple representation, especially if the simplest structure may require oblique source signals (14Go,15Go). A complementary approach, network component analysis (NCA), takes into account known network connectivity in deducing source signals and allows for orthogonal and oblique source signals (4Go). However, if the a priori network connectivity has some degree of uncertainty, as in the case of ChIP-chip data being used to analyze DNA-microarray data, there may be simpler connectivities capable of describing the same data. Alternatively, exploratory factor analysis attempts to simplify structure by performing orthogonal or oblique rotations on a factorization. While the goal of this technique is to achieve simplicity of structure, the implementation has had difficulty with situations where the complexity of the simplest network exceeds that of maximal sparsity (one connection per output to the source layer) (15Go). To facilitate data representation with the simplest structure possible, we have characterized the ability of bipartite networks to describe data.

The ability of bipartite networks to describe data may be limited by network connectivity. In some cases, such as fully connected networks, any data within the span of the network can be described, while in other cases, such as sparsely connected networks, certain elements of the data may be required to lie on a single line or hyperplane. This leads to the classification of two types of bipartite networks, those networks whose output range is not limited by their connectivity, which we will term "versatile", and those networks whose output range is hindered by their connectivity, which we will term "nonversatile". Intuitively, one might think that any missing edge from a network might compromise its ability to describe data, and therefore any network besides a fully connected network will be nonversatile. However, this is not true, and there are networks that are not fully connected that can represent data equally as well as fully connected networks. These networks are also versatile and are not limited by their connectivity. The very existence of these networks demonstrates that there is no justification from data alone to conclude more than minimal versatile connectivity. Thereby, the most complex structure ever needed to describe data is the minimal versatile connectivity. Nonversatile networks, on the other hand, have their own utility, since their constraints are often present in datasets. Since nonversatile networks are often sparser than versatile networks they would provide the simplest representation under many circumstances.

In this article we define the minimal connectivity to achieve versatility, define the constraints present in nonversatile networks, discuss the implications of versatile and nonversatile networks, and suggest possible applications for their use. To demonstrate the utility of these concepts we examined the transcriptional regulatory networks of Saccharomyces cerevisiae and Escherichia coli. We recognized that for bipartite networks the ability to represent data is equivalent to the ability to generate data. With this in mind, we defined connectivity efficiency, which is a measure of how economic the use of connections is within a network with respect to data representation/generation ability. We then analyzed the connectivity efficiencies of the transcriptional regulatory networks of S. cerevisiae and E. coli. We postulated that it may be biologically desirable for organisms to maximize their gene expression range (breadth of possible gene expression profiles) per network edge, since development of a regulatory connection may have some evolutionary cost. Subsequently, we found that both networks lay close to their respective connectivity efficiency maxima, suggesting that connectivity efficiency may have some evolutionary influence.


    BACKGROUND
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are interested in the ability of bipartite networks to represent data. A bipartite network represents an output ei(t) by the linear mixing of sources, pj(t), through a mixing rule described by

Formula 1(1)
where aij values are the connectivity strengths. The mixing rule can be written in a matrix form,

Formula 2(2)
where E is the output data (N x M), A is the matrix of network connectivity strengths (N x L), and P is the collection of source signals (L x M). Bipartite network representation can further be generalized by considering only the connectivity pattern of matrix A,

Formula 3(3)
where the values of the nonzero aij are left unconstrained and can take on any value—positive, negative, or zero. For the purpose of this article, networks with varying connectivity strengths but the same connectivity pattern, ZA, will be discussed identically.

Versatile networks
Ideally, we would prefer to represent data with the simplest network connectivity possible. In the context of this work the simplicity and sparsity of networks will be synonymous. Thus, we seek to find the sparsest network connectivity that can reliably represent data. Naturally, we begin by considering networks that can represent any data. These networks are termed versatile, and are characterized by the following theorem.

Theorem 1
A linear bipartite network with connectivity pattern ZA (N x L) can describe any data within Formula 3 if all reduced forms of ZA, Formula 3 are full row rank.

Here, Formula 3 is defined as the rows of ZA which contain zeros in the ith column of ZA, where zi is the number of zeros in the ith column of ZA. To test this, consider the nonzero entries of Formula 3 as nonzero random values that cannot combine on their own to produce a rank deficiency.

To demonstrate use of Theorem 1 we have provided a hypothetical transcriptional regulatory network in Fig. 1 A, transformed the network into ZA form (Fig. 1 B), and determined all Formula 3 (Fig. 1 C). Both Formula 3 and Formula 3 are full row rank, but Formula 3 is not, and therefore the network in Fig. 1 A does not satisfy Theorem 1. For a network that would satisfy Theorem 1, simply connect TF3 to the first, second, or fourth gene. The proof of this theorem along with examples is presented in Appendix A.

A consequence of the versatility theorem is that all connectivity patterns that satisfy the required criterion will represent data equally. This means that there may exist a minimal connectivity that satisfies the criterion, which may be used to represent data created from denser network structures. To determine the minimal connectivity (sparsest network) to achieve versatility we must find the limit of the criterion. To do so we recognize that Formula 3 can only be full row rank if zi < L for every column of ZA. Therefore, the minimal connectivity to achieve versatility contains L(L – 1) missing edges, specifically (L – 1) per column of ZA. However, not all network connectivities with (L – 1) missing edges per column are versatile. Any network must still be in compliance with the above criterion to be versatile, even if it has the same number of, or a lesser number of missing edges than the minimal connectivity to achieve versatility.

Minimal connectivity for versatility is maximal connectivity for NCA-compliance
Interestingly, there exists a relationship between the minimal connectivity to achieve versatility and NCA. To guarantee the uniqueness of NCA solutions there are three criteria that must be satisfied. The second criterion in Liao et al. (4Go) deals with the connectivity pattern, ZA, and the ranks of its reduced forms, which are essentially identical to the reduced forms described here. It states that the rank of every reduced form, Formula 3 must be (L – 1). The maximum rank for any Formula 3 is (L – 1), and can only be achieved if zi ≥ (L 1). Therefore, a necessary condition for NCA-compliance is that a network must have a minimum of (L – 1) zeros per column.

Versatility requires that all Formula 3 be full row rank. The maximum row rank of Formula 3 is (L – 1), and can only be achieved if zi = (L – 1). This corresponds to the minimal connectivity to achieve versatility described previously. Thus, the minimum connectivity to achieve versatility requires all Formula 3 to have zi = (L – 1) and be of rank (L – 1), and the maximum connectivity (largest number of nonzero connections) to be NCA-compliant requires all Formula 3 to have zi = (L – 1) and be of rank (L – 1). Therefore, the minimum connectivity to achieve versatility is equivalent to the maximum NCA-compliant connectivity. To illustrate, examples have been provided in Appendix A.

Nonversatile networks
Nonversatile networks are those connectivity patterns that do not satisfy the versatility criterion. These networks have a reduced ability to represent data compared to that of versatile networks. Fig. 2 illustrates this concept, where the y axis is a measure of data representation capability (versatility index) that will be defined in the next section, and the x axis is the number of edges in the network. The reduced ability of nonversatile networks to represent data is due to connectivity constraints that dictate the type of data the network is able to describe. However, these constraints are often present in datasets, leading to the possibility of data representation with simpler structures than versatile networks. Therefore, we have characterized these constraints in the following theorem, such that they may aid in simplifying network structures used to represent data.


Figure 2
View larger version (9K):
[in this window]
[in a new window]
 
FIGURE 2  Plot of versatility index versus number of edges, for >10,000 networks with 50 outputs and 10 sources.

 
Definitions
  1. A zero pattern is a 1 x L vector that indicates, by the position of zero entries, which transcription factors (TF) do not control expression of a gene. The number of zero entries in a zero pattern is designated by nz. A system with three TFs has seven possible zero patterns, which are shown in Fig. 1 D. The zero pattern Formula 3 indicates that a gene is not controlled by TF3.
  2. Any gene that satisfies the definition of a zero pattern is a member of that zero pattern. For instance, the zero pattern Formula 3 requires genes to not be regulated by TF3, therefore, gene1,2,4 are all members.
  3. An informative zero pattern, Formula 3 is any zero pattern with > Lnzj members, where nzj is equal to the number of zeros in Formula 3 Fig. 1 A has two Formula 3 and Formula 3
  4. Erj is a matrix composed of the genes (rows of E) that are members of Formula 3 From Fig. 1, Er1 = E(rows 1, 2, 4).

Theorem 2
Any dataset, E (N x L), may be represented by a linear bipartite network characterized by connectivity pattern ZA (N x L) if every Erj has rank ≤ (Lnzj).

Fig. 1 D summarizes all items needed to evaluate Theorem 2. As one can see, only two Formula 3 (Formula 3) exist and need to be evaluated by Erj. For a dataset to be represented by the network in Fig. 1 A, Er1 must have a rank ≤ 2, and Er2 must have rank ≤ 3. The proof of this theorem along with examples is presented in Appendix B. The theorem identifies bipartite connectivity constraints from ZA that must be present within E for ZA to represent it. Theorem 2 may be used to check whether a dataset can be represented by a network. The procedure to use Theorem 2 is presented in Table A1 of Appendix B along with an example to illustrate its use.


View this table:
[in this window]
[in a new window]
 
TABLE A1  Procedure used to determine whether a dataset, E, contains the connectivity constraints dictated by ZA

 
It should be noted that Theorem 2 is general and can be applied to any bipartite network. In fact, if one were to check whether a dataset could be represented by a versatile network, ZA (N x L), only one Formula 3 would be found that had >(Lnzj) members. This Formula 3 would not have any zero entries and would check whether the dataset was contained within Formula 3 a condition present in Theorem 1.

Implications of nonversatile networks
Although a dataset may satisfy Theorem 2 for a particular nonversatile network, the dataset may still contain additional constraints. This is due to the fact that constraints from nonversatile networks are nonunique. In fact, any network that can be created from another network by edge deletion (which we call the offspring networks) will have the same set of constraints or a larger set that contains the previous network's constraints. This means that the nonversatility criterion does not identify the minimal nonversatile connectivity to represent data, but simply identifies whether a dataset may be represented by a particular nonversatile network. To deduce the minimal nonversatile connectivity to represent data a method must be developed that can efficiently search for constraints in data, rather than see if data fits the constraints of a nonversatile network. This leads to the question of network reconstruction from constraints embedded in the data, which we will leave for the Discussion.

Connectivity efficiency
For bipartite networks the ability to represent data is equivalent to the ability to generate data. For transcriptional regulatory networks the ability to generate data would be the ability to generate gene expression. Knowing that transcriptional regulatory networks are generally sparse and that versatile networks of the same size would be fairly dense, we knew that transcriptional regulatory networks would not be versatile, and thus not have the maximal capability. With this in mind, we postulated that it may be desirable for organisms to maximize their gene expression ability per connection of the network, since it is safe to assume that there could be an evolutionary cost associated with the development of every regulatory interaction in the network.

First, we needed to define an index which could give us an indication of how close a network is to being versatile. We wanted the index to range from 0 to 1, where any network with a value of 1 would be versatile and any network with a value of 0 would be the most nonversatile (one connection per output to the source layer). We also required that if a network failed Theorem 1 for every Formula 3 any edge deletion within the network would decrease its index. We require that the nonversatile network fail every Formula 3 because those Formula 3 that comply with Theorem 1 correspond to columns of ZA that are versatile in nature, and thus edge deletion may not change that if they have <(L – 1) zeros. With these conditions in mind we defined the versatility index,

Formula 4(4)
where VI(ZA) is the versatility index of ZA, Formula 4 are the constraints imposed by ZA, max(Zc) are the constraints from the most nonversatile network the same size as ZA, N is equal to the number of outputs, and L the number of regulators. The method to determine Formula 4 and max(Zc) can be found within Appendix D. Both are based off of the principles detailed in Theorem 2. Subsequently, we can define the connectivity efficiency (CE(ZA)), as

Formula 5(5)

Connectivity efficiency (CE) is an average measure of how much each edge in a network contributes to the ability of that network to represent/generate data. We calculated the connectivity efficiency for the transcriptional regulatory networks of S. cerevisiae (CE = 4.7e-5) and E. coli (CE = 1.9e-4). While this might not appear significant, when plots of the versatile efficiencies from networks of the same size (same number of genes and regulators) and edge distribution are created, the versatile efficiency for S. cerevisiae is 87% of the maximum, and that of E. coli is 55% of the maximum, and both lie on the same shoulder of their respective maxima as depicted in Fig. 3. That shoulder represents networks that are sparser than the maximum, and its significance will be address in detail within the Discussion.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
 
FIGURE 3  (A) Connectivity efficiency plot for the transcriptional regulatory network of S. cerevisiae (circle) plotted against networks of the same size (same number of regulators and genes), sampled from the same edge distribution, with a varying degree of edge density (line). (B) Connectivity efficiency plot for the transcriptional regulatory network of E. coli (circle) plotted against networks of the same size (same number of regulators and genes), sampled from the same edge distribution, with a varying degree of edge density (line).

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Generally, it is desirable to describe data in the simplest possible manner. For systems governed by bipartite networks, this translates into describing data with the simplest possible structure. It has long been argued that simplicity of structure has more physical meaning than other considerations, such as orthogonality, during data representation (14Go). In fact, it has been shown that such abstract constraints yield erred results (4Go). In this work we have characterized the ability of bipartite networks to describe data, so as to facilitate data representation with the simplest possible structure. As we have shown, the ability of bipartite networks to describe data is dependent upon the network connectivity. Here we have classified bipartite networks into two categories based on their connectivity, versatile networks that do not have any restrictions imposed by their connectivity on the type of data they can describe, and nonversatile networks that do. This distinction gives rise to exclusive properties of each class that have implications for data representation, data compression, and network and source signal reconstruction.

Versatile networks can describe any data, and do not need to be fully connected. Therefore, the maximal connectivity necessary to describe any data would be the minimal versatile connectivity. This signifies the ability of some versatile networks to explain output generated from denser network structures. Theoretically, this would provide data compression capability superior to that of PCA. However, this capability comes at a cost. Since versatile networks are equally capable there is no way to discern the true network and source signals from data generated by versatile networks. Even if one were to assume that the true network was the minimal versatile connectivity, this would identify a whole class of networks that satisfy Theorem 1. The connections within the network would have no physical meaning since they could be rearranged in many different ways without impacting the system. This would be undesirable for situations where the actual arrangement of connections was of importance, such as in transcriptional regulatory networks. However, nonversatile networks do not share this deficiency.

Nonversatile networks are capable of describing a limited set of data. Restrictions that match those dictated by their network connectivity must be present in datasets for representation by them. This limitation, however, has its utility—since output created from nonversatile networks carry the connectivity restrictions derived from the original network. This enables network and source signal reconstruction on their outputs, and lends credibility to physical meanings attributed to their connections. Though reconstruction remains possible and seems plausible, efficient search algorithms must be designed to probe for connectivity restrictions from nonversatile networks. Whether these concepts will be incorporated into current techniques or form the basis of novel approaches, the additional complication of noise must be hurdled. While versatile networks can describe any data, including data riddled with noise, the restrictions left by nonversatile networks may be obscured by noise and more difficult to locate. This however, is an unavoidable complication when attempting to decipher underlying mechanisms, and does not change the basic principles of versatile and nonversatile network representation.

In addition, the concept of network versatility has been applied to the transcriptional regulatory network of S. cerevisiae and E. coli. Connectivity efficiency, which is an economic measure of connection usage, was calculated for the transcriptional regulatory networks of S. cerevisiae and E. coli and plotted against the connectivity efficiencies of other networks of the same size and sampled from the same distribution. It was found that the connectivity efficiencies of S. cerevisiae and E. coli were 87% and 55% of the maximum of their respective plots, and that both were found on the same shoulder of their maxima. That shoulder represents networks that have fewer edges than the maximal efficient network. This is an important feature because the transcriptional networks of S. cerevisiae and E. coli are more likely to be missing connections than containing erred edges. Therefore, the true transcriptional networks of these organisms should approach the maximal versatile efficiency. In fact, Harbison et al. (18Go) claimed that the 203 transcription factors they performed genome-wide location analysis on is most likely to comprise all of the DNA-binding transcriptional regulators in S. cerevisiae, and that the false-positive rate of their analysis should be ~96% while the false-negative rate should be ~24%. Combined with the fact that the majority of open reading frames in S. cerevisiae have been found after its genome sequencing the size of the transcriptional network should not change much. Therefore, any addition of edges to the transcriptional network of yeast will invariably push the network toward the maximal versatile efficiency. For E. coli, since an analogous genome-wide location analysis has never been done, the likelihood for missing connections over erred connections seems to be even higher. These findings suggest that connectivity efficiency may be a quantity that transcriptional networks evolve to maximize.

In conclusion, we have characterized the ability of bipartite networks to represent data, which has led to the concepts of versatility and nonversatility. Both of these concepts have been derived, described, and discussed in detail. Lastly, we demonstrated the utility of these concepts by analyzing the connectivity efficiencies of S. cerevisiae and E. coli, which suggested that measures derived from these concepts, may have some biological or evolutionary importance.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Transcriptional networks
S. cerevisiae: Using a p-value threshold of 1 x 10–3, transcriptional regulatory networks were obtained from the ChIP-chip data of Lee et al. (19Go) and Harbison et al. (18Go) (YPD and all conditions). The networks were then merged to obtain a network comprised of all transcription factor-promoter binding relationships known through ChIP-chip experimentation.

Escherichia coli: The network was obtained by combining information from RegulonDB version 4 (20Go), Ver. 1.1 of Shen-Orr et al. (21Go), and Pernestig et al. (22Go). CsrA was included as a transcriptional regulator since small regulatory RNAs can be incorporated into bipartite networks without a loss of generality.

Network processing
Due to the size of the transcriptional networks of S. cerevisiae and E. coli, it was necessary to use the versatility index shortcut calculation described in Appendix D. To utilize this calculation, every regulator in the system must have a gene it solely controls. Not all regulators in the transcriptional networks of S. cerevisiae and E. coli have this attribute. Therefore, those regulators without this attribute along with all of the genes they participate in controlling were removed from the networks. The remaining networks (S. cerevisiae: 3630 genes, 147 regulators; E. coli: 680 genes, 71 regulators) were then analyzed as described in Appendix D.

Versatility index plot
Networks were created from an algorithm whose initial N x L network had one edge per output and the same number of edges per regulator. For every iteration an edge was randomly added to the network of the previous step. The algorithm concluded when the network was fully connected. A versatility index was calculated at every iteration for the network of that step. To ensure use of the versatility index shortcut calculation, an output for every regulator was required to contain a single edge, until the remaining NL outputs were fully connected. Then edges were added at random to the remaining L outputs until the network was fully connected.


    APPENDIX A
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Proof of Theorem 1
Definition
The connectivity pattern, ZA, can be defined as

Formula A1(A1)
where the values of the nonzero aij are left unconstrained and can take on any value, positive, negative, or zero. ZA characterizes a class of networks that all have the same zero pattern, but varying connectivity strengths (nonzero aij).

Theorem 1
A linear bipartite network with connectivity pattern ZA (N x L) can describe any data within Formula A1 if all reduced forms of ZA, Formula A1(zi x L), are full row rank.

Here, Formula A1 is defined as the rows of ZA which contain zeros in the ith column of ZA, where zi is the number of zeros in the ith column of ZA.

Proof
If a connectivity pattern, ZA (N x L), can linearly describe any data, E (N x M), within Formula A1 there exists a matrix, A (N x L), characterized by ZA, that can provide an exact decomposition of the data, which is equivalent to the singular value decomposition

Formula A2(A2)
where E is the output data (N x M), A is the matrix (N x L) defined by the zero pattern ZA, P is the linear system solution (L x M) to E and A, S is the diagonal matrix (L x L) of the first L singular values of E oriented in decreasing order, and U (N x L) and V (L x M) are unitary matrices of the right and left singular vectors of the elements in S. It follows that the component matrices of the decompositions will be related as follows:

Formula A3(A3)

Formula A4(A4)
For X to be invertible it must be full rank. The ranks of a matrix and matrix multiplication are governed by

Formula A5(A5)

Formula A6(A6)
Since X is full rank and

Formula A7(A7)

Formula A8(A8)
the ranks of A and P must be allowed to be

Formula A9(A9)

Formula A10(A10)
While P is unrestricted, A is restricted by ZA and may not be allowed to satisfy Eq. A10. The positioning of zeros in ZA may lead to rank deficiencies in A. To check we can consider the nonzero entries of ZA as nonzero random values that cannot combine on their own to produce a rank deficiency. We can then check the rank of ZA directly. However, allowing A to satisfy Eq. A10 is a necessary but insufficient condition to satisfy Eq. A3. For a necessary and sufficient condition, one can break up Eq. A3 as

Formula A11(A11)
where a (j x L) is a collection of j rows from A where j can be any number of rows from 1 to N, Formula A11 (j x L) is the collection of rows from U corresponding to a, and ZA (j x L) is the collection of rows from ZA corresponding to a. X must still be invertible, so to satisfy Eq. A11, a must be allowed to satisfy

Formula A12(A12)
To satisfy Eq. A3, all possible a must be allowed by Za to satisfy Eq. A12. One can now see that Eq. A10 is a special case of Eq. A12, where i = N. Here

Formula A13(A13)
Since u can be full rank, min(j,L), a must be allowed by Za to be full rank. Analogous to A and ZA, the positioning of zeros in Za may lead to rank deficiencies in a. However, it is unnecessary to check all possible Za for rank deficiencies.

We notice that rank deficiencies appear when rows of a contain zeros in the same column/columns. To capture all possible rank deficiencies, we define Formula A13(zi x L), as the rows of ZA, which contain zeros in the ith column of ZA, where zi is the number of zeros in the ith column of ZA. If we consider Formula A13 for every column of ZA, all rank deficiencies can be accounted for. If all Formula A13 are full rank (same check process as ZA above), then a will be allowed to satisfy Eq. A12, and thus Eq. A3 will be satisfied. However, since Formula A13 will always have a zero column by definition, Formula A13 can only be full rank if it is full row rank.

Examples
To illustrate the criterion for network versatility, consider the Network A and B shown in Fig. A1, which can be represented by the connectivity pattern

Formula A13
where the reduced matrices of Formula A13 and Formula A13 are

Formula A13
The rank of these structurally specified matrices can be determined by allowing random nonzero values to occupy the nonzero positions. Here Formula A13 and Formula A13are full row rank, while Formula A13is not. Therefore, Network A is not versatile. On the other hand, all Formula A13 are full row rank, and therefore Network B is versatile.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
 
FIGURE A1  Diagram of two bipartite networks (A and B) that have (L – 1) missing edges per regulator. Network A is nonversatile, and Network B is versatile.

 
Minimal connectivity for versatility is maximal connectivity for NCA-compliance
To illustrate this boundary, consider the networks shown in Fig. A2. Network A is versatile since every Formula A13 is full row rank, but not NCA-compliant, since Formula A13 has a rank of (L–2). Network B is both versatile and NCA-compliant since every Formula A13 is full row rank and of rank (L–1). Network C is nonversatile since Formula A13 is not full row rank, but it is NCA-compliant since all Formula A13 are of rank (L–1). This example illustrates that networks that are versatile can also be NCA-compliant, but that this can only happen if the network is of the minimum connectivity to achieve versatility.


Figure 5
View larger version (21K):
[in this window]
[in a new window]
 
FIGURE A2  Diagram of three bipartite networks (AC) that demonstrate the relationship between NCA and versatility. Network A is versatile but not NCA-compliant, Network B is both versatile and NCA-compliant, and Network C is NCA-compliant but not versatile.

 

    APPENDIX B
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Proof of Theorem 2
Definitions
  1. A zero pattern is a 1 x L vector that indicates, by the position of zero entries, which transcription factors (TF) do not control expression of a gene. The number of zero entries in a zero pattern is designated by nz. A system with three TFs has seven zero patterns, which are shown in Fig. 1 D. The zero pattern Formula A13 indicates that a gene is not controlled by TF3.
  2. Any gene that satisfies the definition of a zero pattern is a member of that zero pattern. For instance, the zero pattern Formula A13 requires genes to not be regulated by TF3, therefore, gene1,2,4 are all members
  3. An informative zero pattern, Formula A13 is any zero pattern with >Lnzj members, where nzj is equal to the number of zeros in Formula A13 Fig. 1 A has two Formula A13 and Formula A13
  4. Erj is a matrix composed of the genes (rows of E) that are members of Formula A13 From Fig. 1, Formula A13

Theorem 2
Any dataset, E (N x L), may be represented by a linear bipartite network characterized by connectivity pattern ZA (N x L) if every Erj has rank ≤ (Lnzj).

Proof
If a connectivity pattern, ZA (N x L), can linearly describe a dataset, E (N x M), within Formula A13 there exists a matrix, A (N x L), characterized by ZA, such that

Formula 1(B1)
It follows that E may be described implicitly as a function of A, if A has full rank. If A has full rank, A can be partitioned into A1 ((NL) x L) and A2 (L x L) such that A2 is invertible. After partitioning and substituting for P, one obtains

Formula 2(B2)
Note that multiple partitions of A exist, since the only requirement of Eq. B2 is that A2 be invertible. Consequences of this point will be discussed shortly.To determine whether restrictions originate from ZA that are required to exist in E for Eq. B1 to be satisfied, we make the following transformation:

Formula 3(B3)
For generalization purposes, we substitute Formula 3 for Formula 3 This notation, which will be used for the remainder of the text, signifies that we only know that A (N x L) is characterized by ZA and that all nonzero values are considered unknown. Analogous to Eq. B2, for Eq. B3 to hold true, Formula 3 has to have full rank, and thus ZA has to have full rank. The rank can be calculated by considering the nonzero entries of ZA as nonzero random values that cannot combine on their own to produce a rank deficiency. Eq. B3 represents a relationship solely between the data, E, and network, ZA.Formula 3 is defined analogous to ZA, where the positions of the zero entries are known and the nonzero entries are left unconstrained. However, unlike ZA, where a zero entry indicates the absence of a connection between a source and output, zeros within Formula 3 indicate constraints on how outputs may be related to one another. For instance, if the following Formula 3 were obtained,

Formula 3
then the first row of E1 would need to be a multiple of the first row of E2, while the other rows of E1 could be any linear combination of all of the rows of E2.

Zeros within Formula 3 dictate how the outputs of E may be related, and thus represent connectivity constraints from Formula 3 However, the partition from Eqs. B2 and B3 is inherently nonunique, since there can be multiple partitions of ZA that meet the full rank requirement of Formula 3 As one might expect, different selections of Formula 3 generate different zero patterns in Formula 3 Thus, separate connectivity constraints are identified by different network partitions. To properly define the output limits of a network, all of the constraints must be considered. This requires an understanding of how zeros propagate from ZA to Formula 3

Since the nonzero elements in ZA are left unconstrained and can take on any value, the determination of Formula 3 is not a case of simple linear algebra. Therefore, we have derived a set of rules that that describe how zeros propagate through structural multiplication (Formula 3) and structural inverse (Formula 3) operations. These operations are analogous to their linear algebra counterparts, except that instead of being defined for fully specified matrices, their operations are designed for networks defined analogous to ZA.

Rule 1
Zeros can only be created by multiplication (ZAZB), if a row of ZA is structurally perpendicular to a column of ZB. For a row to be structurally perpendicular to a column, they must have zeros in complementary positions.

Rule 2
The number of zeros that propagate through a structural multiplication, ZAZB, where ZB is invertible, is limited by: Formula 3 where Formula 3 is the number of zeros in ZAZB, and Formula 3 is the number of zeros in ZA.

Rule 3
Zeros can only exist in Formula 3 if singular minors can be created from ZA.

Rule 4
The number of zeros that propagate through a structural inverse is limited by Formula 3 where Formula 3and Formula 3is the number of zeros in Formula 3

Rule 5
Zeros in Formula 3 are created from members of Formula 3 If exactly (L nzj) Formula 3 members are partitioned into Formula 3 the Formula 3 members in Formula 3will be structurally perpendicular to zeros in Formula 3 created from Formula 3 members in Formula 3

Proofs for these Rules can be found in Appendix C.

According to Rule 5, zeros within Formula 3 occur when ≥ (Lnzj) members of Formula 3are in Formula 3 and exactly (Lnzj) members are partitioned into Formula 3 We recognize that it does not matter which members of Formula 3 are partitioned into Formula 3 and that the zeros in the remaining members of Formula 3 will be conserved in Formula 3 Finally, by rearranging Eq. B3, we obtain

Formula 4(B4)
To satisfy Eq. B4, for every rowi of Formula 4 the matrix composed of those rows of Formula 4 that multiply against nonzero entries in rowi of Formula 4 should be of rank ≤ (Lnzj). Therefore, Erj created from collecting all of the rows of Formula 4 that correspond to members of Formula 4 should have rank ≤ (Lnzj) if ZA can represent E. This will be a requirement for all possible Formula 4

Example
To illustrate the use of Theorem 2 and Table A1, consider the network in Fig. A3 and corresponding connectivity pattern:

Formula 4
For this ZA we can construct Table A2. Only two zero patterns are informative, Formula 4 and Formula 4 To demonstrate zero generation in Formula 4 we partition (Lnzj) Formula 4 members into Formula 4

Formula 4
It follows that

Formula 4
After rearranging Eq. B3 and substituting for the current example, we obtain

Formula 5(B5)


Figure 6
View larger version (8K):
[in this window]
[in a new window]
 
FIGURE A3  Diagram of a bipartite network used for deduction of connectivity constraints from Table A1.

 

View this table:
[in this window]
[in a new window]
 
TABLE A2  Example of using Theorem 2 and the procedure from Table A1 for bipartite network in Fig. A3

 
Eq. B5 states that:
  1. The matrix formed by rows 1, 3, 6 of E must have rank ≤ 2.
  2. The matrix formed by rows 2, 3, 6 of E must have rank ≤ 2.
  3. The matrix formed by rows 5, 3, 4, 6 of E must have rank ≤ 3.

Note that Statement 3 simply checks if E is in Formula 5 and that Statements 1 and 2 require that

Formula 5
has a rank ≤ (Lnzj) = 2. For a dataset to be represented by the network in Fig. A3, Formula 5 must have a rank ≤ 2, and Formula 5 must have a rank ≤ 3.


    APPENDIX C
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Structural linear algebra proofs
The first two rules deal with properties of structural multiplications. The first rule states that for a zero to be created in ZAZB a row of ZA must be structurally perpendicular to a column of ZB. Be reminded that the nonzero entries of both ZA and ZB are left unconstrained, and thus may take on any value. Therefore, we must allow the product of any two nonzero entries to also be left unconstrained. So for any two vectors to be perpendicular, when both vectors are structurally defined, every nonzero entry of one vector must multiply against a zero entry of the other vector. This is the definition of structurally perpendicular.

As a consequence of Rule 1, a vector Formula 5 (1xL) can be structurally perpendicular to only as many vectors of an L basis as Formula 5 has zero entries. To illustrate, consider a vector Formula 5 (1 x L) and an invertible matrix B (L x L), where both are structurally defined and B is a basis of Formula 5 space:

Formula 1(C1)
To produce a zero within Formula 1 (1xL), Formula 1 must be structurally perpendicular to a column vector of B, a vector of an Formula 1 basis. The sparsest possible structurally defined basis, B, is diagonal or a permutation thereof. In that case, Formula 1 would be structurally perpendicular to as many vectors of B as Formula 1 has zero entries. However, if B is not diagonal or a permutation thereof, Formula 1 can be structurally perpendicular to only as many vectors of B as Formula 1 has zero entries, but may be less, depending on the structure of Formula 1 and B. To demonstrate, consider the following Formula 1 and basis, B:

Formula 1
Both bases have the same number of nonzero entries; however, the structure of the first basis allows the number of zeros in Formula 1 to propagate to Formula 1 while the second basis does not. Therefore, the number of zeros in Formula 1 will always be less than or equal to the number of zeros in Formula 1 Since ZA is a collection of stacked Formula 1 vectors the same holds true for all the rows of ZA.

Rules 3 and 4 deal with the properties of structural inverses. Inverses are defined as

Formula 2(C2)

Formula 3(C3)
for i,j = 2,3

Formula 4(C4)
When Mij is singular you will see a zero at position (j,i) of A–1. However, to reiterate, the nonzero entries of ZA are left unconstrained. This means that assumptions cannot be made about the values of the nonzero entries, and thus zeros within Formula 4 must come from minors that are singular irrespective of the nonzero entries. As it turns out, any possible row zero pattern that may be found in ZA, can create singular minors in Formula 4 if there are exactly (L zj) members in ZA. Rule 4 states that the number of zeros in Formula 4 will always be less than or equal to the number of zeros in ZA. To explain, consider the following linear algebra operation:

Formula 5(C5)
An analogous operation can be defined for structural linear algebra operations,

Formula 6(C6)
only for those invertible ZA that have zero entries that all contribute to singular minors for Formula 6 Otherwise,

Formula 7(C7)
and the number of zeros in Formula 7 is less than that in ZA. To illustrate, consider the following:

Formula 7
Both ZA and ZB have the same number of zero entries, except those in ZB all contribute to singular minors whereas those in ZA do not. Therefore, the number of zeros in ZA equals the number in Formula 7 and the number in ZA is less than the number in Formula 7

The final rule is a combination of the knowledge from the first four rules. By following structural linear algebra Rules 1–4 we realize that zeros within Formula 7 are generated when there are >(Lnzj) members of Formula 7j in ZA, and (Lnzj) are contained within Formula 7 This is because any member of Formula 7 will be structurally perpendicular to zeros in Formula 7created from its fellow members.


    APPENDIX D
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS
 REFERENCES
 
Determining Formula 7 and max(Zc)
Both Formula 7 and max(Zc) can be determined from Theorem 2. The number of constraints (Formula 7) imposed by ZA on a dataset E, is

Formula 1(D1)
where Formula 1 and nzj are defined from Theorem 2, and n is equal to the total number of Formula 1 that have >(Lnzj) members.The most nonversatile network that is the same size of ZA will be the sparsest network, and thus have the largest number of missing edges. The network that has the largest number of missing edges and is the same size as ZA will be a network that has one edge per row. However, the one edge per row criteria classifies a large number of networks that all have the same sparsity. So the question arises, which one contains max(Zc)? The answer is that all ZA (N x L) that have N edges, have one edge per row, and are of the same size have the same number of constraints, and thus may be used to calculate max(Zc). A shortcut calculation can be derived from the above equation by realizing that there cannot be any zero columns of ZA. Therefore, one can calculate max(Zc), from the following formula with only knowledge of the network size, N x L, and not the structure:

Formula 2(D2)
The first term of Eq. D2 is the equivalent to the first term in the summation of Eq. D1 when all rows only have one edge, and analogously the second term of Eq. D2 is equivalent to the second term in the summation of Eq. D1 when all rows only have one edge.

Since

Formula 2
we can make the following substitution:

Formula 3(D3)
A similar formula to calculate Formula 3 can be obtained under situations when there is at least one row per column that is controlled by only that column,

Formula 4(D4)
where ni is equal to the number of rows with i nonzero entries.It should be noted that Zc may be different for different ZA even though they may have the same number of edges. This is because even though two rows with three edges each, have the same number of edges as three rows with two edges each, there is a difference between 2 x 23 = 16 and 3 x 22 = 12.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 DISCUSSION
 METHODS
 APPENDIX A
 APPENDIX B
 APPENDIX C
 APPENDIX D
 ACKNOWLEDGEMENTS