Bioinformatic Analysis of Coronary Disease Associated SNPs and Genes to Identify Proteins Potentially Involved in the Pathogenesis of Atherosclerosis

Factors that contribute to the onset of atherosclerosis may be elucidated by bioinformatic techniques applied to multiple sources of genomic and proteomic data. The results of genome wide association studies, such as the CardioGramPlusC4D study, expression data, such as that available from expression quantitative trait loci (eQTL) databases, along with protein interaction and pathway data available in Ingenuity Pathway Analysis (IPA), constitute a substantial set of data amenable to bioinformatics analysis. This study used bioinformatic analyses of recent genome wide association data to identify a seed set of genes likely associated with atherosclerosis. The set was expanded to include protein interaction candidates to create a network of proteins possibly influencing the onset and progression of atherosclerosis. Local average connectivity (LAC), eigenvector centrality, and betweenness metrics were calculated for the interaction network to identify top gene and protein candidates for a better understanding of the atherosclerotic disease process. The top ranking genes included some known to be involved with cardiovascular disease (APOA1, APOA5, APOB, APOC1, APOC2, APOE, CDKN1A, CXCL12, SCARB1, SMARCA4 and TERT), and others that are less obvious and require further investigation (TP53, MYC, PPARG, YWHAQ, RB1, AR, ESR1, EGFR, UBC and YWHAZ). Collectively these data help define a more focused set of genes that likely play a pivotal role in the pathogenesis of atherosclerosis and are therefore natural targets for novel therapeutic interventions. DOI : 10.14302/issn.2326-0793.jpgr-17-1447 Corresponding author: Timothy D. Howard, Center for Genomics & Personalized Medicine Research, Wake Forest School of Medicine, Medical Center Blvd, Winston-Salem, NC 27157, Phone: (336) 713-7509, Fax: (336) 713-7566, e-mail: tdhoward@wakehealth.edu Running Title: Bioinformatic Analysis of Coronary Disease Associated SNPs

Graph theory and pathway analysis of protein interactions has proven useful for identifying essential proteins in complex protein networks 5;6 and elucidating physiologic mechanisms for complex traits, such as familial combined hyperlipidemia

Selection and Curation of CAD Associated
Genes.We included the genes assigned to the SNPs in the original CARDIoGRAM publication ("positional candidates"), as well as any genes linked to these SNPs in previously published expression quantitative trait loci (eQTL) analyses.The initial set of target genes was based on 162 unique SNPs identified by the CARDIoGRAM GWAS meta-analysis 2. These included the "known CAD susceptibility loci" (Table 1  shown along with the proxy SNP (Supplemental Table S1).
Construction of Gene Interaction Networks.
The selected CAD associated genes from above were used as the initial set of genes to construct gene

Availability of data and materials
Additional data used in this study is available in Supplemental Tables 1 through 5.

Ethics and Consent to participate
The original data used in this manuscript was obtained from published material, and no additional human subjects were included.

Results
CAD Associated Gene Prioritization.The 162 CARDIoGRAMplusC4D SNPs were associated with 160 unique genes, based on proximity alone.eQTLs were prioritized by selecting cis SNPs with a minimum eQTL score of 6 (p=10 -6 in their respective, original study).
eQTL analysis with the 162 SNPs and their LD proxies identified an additional 34 unique genes that were not included in the previous publication.Seventeen of the original positional candidates were also eQTLs (Supplemental Table S1).Twelve SNPs were associated with expression of at least two nearby genes, with a maximum of four genes for rs602633 (CELSR2, SORT1, PSRC1, and PSMA5).The strongest overall eQTL was with rs1412444, a proxy for the original SNP rs2246833 (r 2 =1.0) and LIPA expression in monocytes (eQTL score = 163.21).The original 160 positional genes and the 34 unique eQTL genes were combined for all downstream analyses, for a total of 194 unique genes.
Construction of the Gene Interaction Network.
Of the 194 unique, CAD-associated genes curated from the CardioGramPlusC4D study and the eQTL analysis combined, 185 of these were found and mapped in the IPA database.These genes were used as seeds for the network construction.IPA network construction identified four major networks (Supplemental Table S2).
These four networks were then merged into one large network, which included 422 connected nodes (molecules) with 1890 edges (relationships) (Supplemental Table S3).
Vol  APOA1 (red text and underlined) were the common, top -ranked genes identified by all three methods (LAC, eigenvector centrality, and betweenness), indicating the importance of these genes in the network.In addition to these three common seed genes, ten genes not in the original seed set were also identified by all three methods.These 10 new genes are TP53, MYC, PPARG, YWHAQ, RB1, AR, ESR1, EGFR, UBC and YWHAZ.
Combining the LAC, eigenvector centrality, and betweenness lists in Table 1, a total of 10 genes (CDKN1A, APOE, SMARCA4, APOA1, APOC2, TERT, APOB, APOC1, APOA5 and SCARB1) are from the original seed set, which suggests that these CAD associated genes are important in the gene interaction network.Figure 1 shows the interactions between these 10 genes (in red) and their interacting genes (in blue) and chemicals (in green) in the gene interaction network.Most of these top genes are highly connected in the sub-network.
Pathway Analysis.The top-ranked proteins from Table 1 were selected to perform metabolic and signaling canonical pathways analysis using IPA.The result is shown in Supplemental (The genes from the original seed set are highlighted in red.The common seed genes identified by all three methods are in red text and underlined approach.These included TP53, MYC, PPARG, YWHAQ, RB1, AR, ESR1, EGFR, UBC and YWHAZ, which were identified by all three analysis methods, but do not have the same level of prior literature evidence supporting a known association with cardiovascular disease.These proteins also rank highly by betweenness scores, indicating they may be involved in multiple pathways, and fewer proteins may perform their function within pathways.In our study, each of these novel proteins interacted with at least three of our seed proteins (Figure 1), supporting the plausible importance of their role in the biology of coronary artery disease and atherosclerosis progression.
Four of these 10 highly-connected novel genes (TP53, MYC, YWHAQ, and YWHAZ) were also identified recently in an independent publication as "Predicted CVD genes" using a different pathway-based approach22.To summarize, there are numerous biological connections between the top ranked proteins identified in this expanded network analysis of coronary artery disease genes, and these connections support the inclusion of these molecules as candidates for follow-up analysis in the GPAA project.Furthermore, these discoveries support the utility of this expanded approach to the analysis of genomic scale datasets for the identification of candidate disease proteins.The validity of our approach can be illustrated by the APOA1 node in our predicted network.Mutations that alter the functioning of APOA1 could adversely impact the functioning of several interacting proteins, as indicated by the high hub score of the APOA1 node.In addition, APOA1 interacts strongly with other apolipoproteins (e.g., APOB, APOE) that also have high node scores.
LDLR interacts with all three of these proteins (Figure 1), and exome sequencing recently identified a marked increased risk of myocardial infarctions in individuals with rare mutations in LDLR 3, further highlighting the utility of evaluating proteins targeted within the biological hub.
As further validation of biological relevance, our pathway analysis of the top ranked proteins in the network analysis identified a list of pathways that are known to influence atherosclerosis ( estimates in the betweenness scores.Finally, our approach used the genes nearest to the associated SNPs when eQTLs were not identified.More distal genes may be regulated by these SNPs, but without additional functional data these loci were difficult to identify and we used the most likely genes to be involved in each region.

Conclusion
Using a protein-protein interaction network approach, we have identified the most likely genes involved in CAD-related phenotypes using the CARDIoGRAM GWAS meta-analysis as a starting point 2.
In addition to the well-known candidates, we identified a subset of genes that interact with these likely contributors, but have not otherwise been associated

Introduction
Atherosclerosis is a multifactorial disease with a strong genetic component.Genome wide association studies for coronary artery disease (CAD) related phenotypes have identified at least 56 susceptibility loci at genome wide significance 1;2, and a study into the role of low-frequency (frequency 1% -5%) and rare (frequency < 1%) DNA sequence variants in early onset myocardial infarction (MI) identified additional candidate genes 3. Investigation of proteins encoded by genes in close proximity to the susceptibility loci or implicated in the analysis of rare variants may lead to an enhanced understanding of the molecular mechanisms of atherosclerosis, and thereby facilitate the identification of novel candidates for targeted therapeutic interventions.As part of the Genomic and Proteomic Architecture of Atherosclerosis (GPAA) project, we plan to utilize sensitive and highly accurate targeted mass spectrometry to quantify and thereby validate proteins identified as putative pathogenic candidates driving coronary artery disease.Multiple reaction monitoring (MRM) experiments will be performed on arterial tissue samples from individuals with and without extensive premature atherosclerosis collected as part of the Pathobiological Determinants of Atherosclerosis in Youth (PDAY) study 4. The PDAY study measured the extent and prevalence of atherosclerosis in 2,876 subjects between the ages of 15 and 34 who died of non-cardiac related causes.In order to utilize this precious resource to its full potential, we must first identify candidate proteins for assay development, and we seek to identify these candidates by combining discovery proteomics with bioinformatic data mining of network and pathway analysis of SNPS and genes associated with coronary disease from previous GWAS and rare variant association studies.Our goal is to expand the list of candidate proteins beyond the handful of well-known atherosclerosis proteins to include additional and novel proteins that represent the full spectrum of pathogenic molecular events underlying atherosclerosis development.Within the context of the GPAA project, the purpose of the current analysis is to identify relevant proteins, encoded by genes near susceptibility loci, to define an expanded set of candidate proteins hypothesized to contribute to the onset or development of atherosclerosis.

4 )
centrality and betweenness analysis to identify the key players in the network.The experimentally observed relationships, such as protein-protein interactions, protein-DNA interactions, protein-RNA interactions, co-expression, translocation, activation, inhibition, molecular cleavage, membership, and phosphorylation were used to bring in other interacting molecules from the Ingenuity Knowledge Base to the network, and the additional Freely Available Online www.openaccesspub.org| JPGR CC-license Vol-2 Issue 1 Pg.no.-4 molecules were used to specifically connect two or more smaller networks by merging them into a larger one.The resulting multiple networks were then merged into one network.The following parameters were used in the network construction: 1) All genes and chemicals in the Ingenuity Knowledge Base were used as the reference set and the species was set to human; 2) Only the direct relationships were considered; 3) The confidence level was set to be "Experimentally Observed" to retrieve the relationships that have been experimentally observed; The number of molecules per network and the number of networks were set to the maximum allowed, 140 and 25, respectively.Gene Interaction Network Analysis.Network analysis was performed using Cytoscape (www.cytoscape.org,version 3.1.1)and the CytoNCA plugin 16.Local average connectivity (LAC), eigenvector centrality and betweenness scores were calculated for each gene in the gene interaction network using CytoNCA.The direction of the edges is not considered in the network analysis.Parallel edges between two gene nodes represent different types of relationships that were observed between those two nodes.To reduce redundancy, these parallel edges and self-loops were removed in the network analysis.Pathway Analysis Methods.Candidate genes selected from the network analysis were again analyzed with IPA for biological functions, cellular locations, signaling and metabolic canonical pathways, and associated diseases.The p-values for the identified canonical pathways, disease associations and functions were calculated using Fisher's exact test.The Benjamini-Hochberg method was used to estimate the false discovery rate (FDR), and an FDR-corrected p-value of 0.05 was used to select significantly enriched pathways.

Both
TP53 and MYC are well-known for their role in cancer and may also be involved in the regulation of smooth muscle cell proliferation during neointima formation in coronary artery disease 23;24.Much less is known about YWHAQ and YWHAZ, which are highly conserved scaffolding proteins of the 14-3-3 family, involved in multiple signal transduction pathways including those linked to p53 apoptosis signaling25 and Epidermal Growth Factor Receptor (EGFR) signaling26.The EGFR protein was another of the 10 novel top proteins identified in this analysis, and is a well-known activator of ERK/MAPK signaling which was among the top canonical pathways from the IPA analysis of these data.While EGFR is known to be expressed in atherosclerotic plaques 27;28, its mechanistic role in coronary artery disease pathogenesis is as yet unclear.Interestingly, another cell-signaling scaffold protein, Growth Factor Receptor Binding Protein 2 (GRB2), was also detected among our top 49 candidate proteins, and together with YWHAZ, has been shown to be involved in the clathrin-endocytosis mediated internalization of EGFR29.Furthermore, GRB2 has been identified as a critical protein for neointima and atherosclerotic lesion formation in ApoE -/-mouse models of coronary artery disease30;31.These connections become rather interesting in light of our observation of "clathrinmediated endocytosis" as a top pathway in the IPA analysis (

Figure 1 .
Figure 1.The interactions between 10 top ranking genes (red nodes) and their interacting genes (blue nodes) and chemicals (green nodes) in the sub-network.The graph was generated with Cytoscape 35 .

Table 1 .
Top network nodes ranked by LAC, eigenvector centrality and betweenness scores.

Table 2 .
Top pathway hits of the selected network genes

Table 2
analyses, was limited by the current state of knowledge Freely Available Online www.openaccesspub.org| JPGR CC-license Vol-2 Issue 1 Pg.no.-10 of protein interactions.The lack of evidence for interactions between proteins should not be interpreted as evidence for lack of such an interaction.Proteins