Utilization of bioinformatics tools for understanding proteomics data
Department of 2Biomedical Engineering, Kocaeli University Technology Faculty, Kocaeli
Department of 3Biology, Muğla Sıtkı Koçman University, Muğla
ABSTRACT • Scientifc discoveries in life sciences rely on organizing biologicalknowledge and enabling rapid utilization of it. In fact, in most cases interpretation ofgenomics, functional genomics, transcriptomics, metabolomics and proteomics datais much more important than producing it. In World Wide Web, there are many freelyavailable databases and relevant bioinformatics software for interpretation of proteomics data. Some of these databases and relevant software focus on analysis of the coredata. However, interpretation of the output fles from those databases and relevantsoftware can sometimes be challenging. In this study, we aimed at using several often-used freely available databases and relevant software to interpret proteomics data.Our aim was to show how to reach these databases in order and utilize the providedinformation to reduce data complexity. Our experimental setup contained eight proteinspicked from a study performed in our laboratory. Our analysis revealed the presence ofa major and a minor interactomes affected by the changes at the protein levels. We alsodescribed how to easily reach relevant biological information for the effected proteinsand sort out what is needed for data interpretation
In a simple experimental setting, a researcher can measure activity of a protein, determine its concentration in a solution or assess its stability under various experimental conditions. Such information is useful in understanding individual properties of that protein. However, without knowing how proteins behave in a living cell, it is highly difficult to assess their physiological functions because proteins interact with each other and other biological molecules to carry out their tasks. The interactions determine when they ought to be active, how long they should stay active, what type of post translational modifications they are subjected to and to which molecular pathways they should involve with. To determine the nature of interactions among proteins, there are several experimental approaches developed. Co-immunoprecipitation (Co-IP) is considered to be the golden standard to determine protein–protein interactions (1). In Co-IP, the protein of interest is isolated with a specific antibody while its interaction partners are in contact with it. A subsequent mass spectrometry analysis is then used for identification of the interaction partners. To assess whether two proteins interact with each other, the yeast two hybrid system is often used and offers some advantages over Co-IP, such as easy scale-up, involvement of no purification steps or optimizations with regard to binding or washing conditions and possible automatization using robotic platforms (2). Fluorescence resonance energy transfer (FRET) is another experimental approach that detects the proximity of fluorescently labeled molecules over distances of >100 Å (3). FRET can map protein-protein interactions in vivo and offers the opportunity to study the complex behavior of key regulatory proteins in their natural environment. Tandem affinity purification (TAP) is a less preferred experimental approach for studying protein–protein interactions (4). TAP-tag allows determination of protein partners quantitatively in vivo without prior knowledge of complex composition. Other methods to investigate protein-protein interactions include Bimolecular fluorescence complementation (BiFC) (5), Phage display (6), in vivo crosslinking of protein complexes using photo-reactive amino acid analogs (7), Chemical cross-linking followed by high mass matrix assisted laser desorption/ionization (MALDI) mass spectrometry (8) and Proximity ligation assay (PLA) (9). Using these or similar experimental approaches, researchers collected vast amount data regarding protein-protein interactions. To provide a critical assessment and integration of protein-protein interactions using these huge data sets, a number of databases and online resources dedicated to protein networks were created. In this communication, we wish to demonstrate how to make sense of a protein data set that obtained from a proteomics study by using several commonly available bioinformatics tools.
selection of the Differentially Regulated Proteins
Preliminary data from two-dimensional gel electrophoresis experiments that was carried out inour laboratory to study changes in protein expression levels of breast cancer subtypes were usedand eight differentially regulated proteins out of32 were selected for analysis. The regulation levels were the highest among all indicating that theselected proteins might be good candidates for afunctional analysis. To keep the focus on bioinformatics analysis of the selected data, we did notgive any details about experimental procedures regarding collection of the proteomics data. They willbe published elsewhere.
PubMed search (https://www.ncbi.nlm.nih.gov/pubmed/) was carried out using appropriate keywords and the names of the regulated proteinsboth in abbreviated and spelled out formats. Thekey words were function, sub-cellular localization,distribution of tissue expression levels, interactionpartners, post translational modification, structure, disease and pathology. In some cases, to limit, widen or define the searches, Boolean parameters AND, OR, and NOT were used. The retrieveddata were manually analyzed and interpreted.
Uniprot search (http://www.uniprot.org/) was carried out using the names of the regulated proteinsin abbreviated or spelled out formats. To limit theretrieved data, the name of the preferred organism(Homo sapiens) was printed to the search enginealong with the name of the protein. Un-reviewedentries were not retrieved and not used for any ofthe analysis. Entries with high annotation scores(4 to 5) were considered.
The PANTHER Analysis
The PANTHER analysis (Protein AnalysisTHrough Evolutionary Relationships, http://PANTHERdb.org/) was carried out using the Uniprotaccession numbers of the regulated proteins. Theorganism was specified as Homo sapiens and functional classifications were viewed as pie charts.selected ontologies included molecular function,biological process, cellular function, cellular component, protein class and pathway. For each of theontology, manual analysis was performed by listing the selected proteins and cross-checking theirgiven properties with Uniprot entries.
The STRING Analysis
STRING analysis (https://string-db.org/) was carried out using the Uniprot accession numbers ofthe regulated proteins. The search engine optionwas set to “multiple proteins by names/identifiers”and the organism was specified as Homo sapiens.The retrieved proteins were manually checked toassure that they are all correctly retrieved fromthe database. Whole genome analysis was the preferred choice. The setting tab was used to changethe stringency of the analysis. The results weredownloaded as bitmap images and were recreatedby Adobe Illustrator Version 6.
The BioGrid Analysis
The BioGrid analysis (https://thebiogrid.org/) wascarried out using the Uniprot accession numbers ofthe regulated proteins. The organism was specifiedas Homo sapiens and the protein accession numbers were entered to the search engine one at atime. The retrieved data were copied and pastedinto Microsoft Excel for cross comparative analysis. Minimum evidence level was set using a pulldown menu under network tab. The networks weredownloaded as bitmap images and were recreatedby Adobe Illustrator Version 6.
Experimental Scheme and Collection of the Data
A researcher using two-dimensional gel electrophoresis determined changes in expression profiles of eight proteins in human breast tumor tissues in comparison to the control tissues. During the follow up experiments, mass spectrometry analysis was performed and differentially regulated proteins were identified. The identified proteins were, namely, Hsp90, endoplasmin, adenosine triphosphate (ATP) synthase subunit alpha, malate dehydrogenase, serotransferrin, vimentin, 14-3-3 protein theta and glial fibrillary acidic protein. To evaluate the data, the researcher is needed to gather information about each differentially regulated protein to determine its function, sub-cellular localization, distribution of tissue expression levels, its interaction partners, the presence of post translational modifications, its structure and any association with a known pathology.
The first database the researcher searched was PubMed. PubMed is a server established by The National Center for Biotechnology Information (NCBI) to help advances in science and health by providing access to biomedical and genomic information. PubMed comprises more than 28 million citations for biomedical literature from MEDLINE, life science journals, and online books. The researcher performed a PubMed search using the protein name and several keywords for each protein. For example, for endoplasmin alone, 585 publications were retrieved using the key words function, subcellular localization, expression, interaction, PTM, structure and pathology. For all the identified proteins, there were more than 2000 publications to sort out for the information the researcher needed. Therefore, a more practical approach was used to gather information about all the identified proteins.
UniProt Database Search
The second database the researcher searched was Uniprot. Uniprot is a comprehensive resource for protein sequence and data annotation and formed by collaboration among the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) (10). To create the Uniprot database, non-curated data retrieved from selected databases e.g., Gene Ontology (for functional information), Pride (for protein identification), InterPro (to determine protein families and domains), IntAct (to determine protein-protein interactions) and IntEnz (for enzyme information) are curated and annotated. In addition, signal sequences, transmembrane domains and if not available, domains and motifs are predicted by UniProt to provide detailed information. Rather than sorting through all the publications retrieved from the PubMed, the researcher used UniProt as the source to collect the information summarized in Table 1. Analysis of the information retrieved from Uniprot database showed that four of the proteins were cytoplasmic, two were mitochondrial, one was Endoplasmic Reticulum (ER)-associated and one was a secreted protein. Two of the cytoplasmic proteins, vimentin and glial fibrillary acidic protein, were structural proteins belonging to class-III intermediate filaments. The researcher knew that intermediate filaments provide mechanical strength to the cells and tissues. However, recent studies demonstrated that intermediate filaments can also act as powerful modulators of cell motility and migration and may play crucial roles in wound healing and tissue regeneration, as well as inflammatory and immune responses (11). It was demonstrated that mal-functioning of intermediate filaments contributes to a diverse range of pathologies placing them at the interface between health and disease. One of the differentially regulated proteins, ATP synthase alpha, was a mitochondrial protein and directly participates in production of ATP synthesis via electron transport chain. Another mitochondrial protein that was subjected to regulation was malate dehydrogenase, which helps production of nicotinamide adenine dinucleotide (NADH) via citric acid cycle. Malate dehydrogenase also helps transport of cytoplasmic NADH to mitochondria via malate-aspartate shuttle and increase the yield of cellular energy production. The researcher observed an increase in ATP synthase levels while a dramatic decrease was observed in malate dehydrogenase levels. Based on his prior knowledge about tumor cells and their high energy requirement for fast growth, the researcher was expecting to observe an increase in the levels of both proteins. However, mitochondria in cancer cells show active function of oxidative phosphorylation although citric acid cycle is stalled (12). This may explain the reason for observing an increase in the levels of ATPase alpha subunit contrary to the decrease in the levels of malate dehydrogenase. Further examination of Table 1 by the researcher revealed that the levels of two cytoplasmic proteins have changed. One of those proteins was Hsp 90, a co-chaperone that binds numerous kinases and helps their proper folding. Hsp90 involves in many cellular processes by facilitating protein folding, the binding of ligands to their receptors and the assembly of multi-protein complexes (13) . The other cytoplasmic protein the researcher identified was 14-3-3. 14-3-3 proteins have the ability to bind a multitude of functionally diverse signaling proteins, including kinases, phosphatases, and transmembrane receptors. This plethora of interacting proteins allows 14-3-3 to play important roles in a wide range of vital regulatory processes (14). The researcher was also surprised to see that an ER associated protein, endoplasmin, was subjected to regulation in tumor cells. Endoplasmin is also a molecular chaperone and functions in the processing and transport of secreted proteins. It is especially associated with ER-associated degradation pathway (ERAD) and its levels were found to be higher in breast cancer patients with decreased survival (15). Along the same line, the researcher realized a change in the level of a secreted protein, serotransferrin. Serotransferrin is an iron transferring protein and is highly present in the serum (16). Yet, it is also found in other tissues. The question arose in the mind of the researcher regarding if there is any association between the changes observed in endoplasmin levels and serotransferrin levels.
The PANTHER analysis
To associate protein functions with metabolic pathways, the researcher analyzed the data using PANTHER server (http://pantherdb.org/). The PANTHER classification system combines protein function, ontology, pathways and statistical analysis tools to enable researchers to analyze large scale proteome wide data. PANTHER allows immediate classification of the proteome data based on molecular function, biological process, cellular component, protein class and pathway. When the researcher performed the PANTHER analysis, the following implications were withdrawn from the data; 1-Changes in protein levels have mainly affected metabolic processes. 2-The differentially regulated proteins mostly displayed binding activity. 3-There was no consensus on a single affected pathway (Figure 1). Unfortunately, the information gathered was not supportive enough to create a link among the regulated proteins, which might have been caused by the small number of entries the researcher used. The researcher might have needed a large set of proteomics data to create a link among the differentially regulated proteins
The BioGrid Analysis
BioGrid is an interaction repository database that searches 64.583 publications for 1.546.181 proteins and genetic interactions from 66 model organisms. The researcher used BioGrid database to elucidate the interaction partners of the identified proteins. The results generated by the researcher were summarized in Table 2. When a minimum evidence level of 2 was used, there were vast numbers of interactions detected for each protein indicating that the proteins were behaving as hubs for balancing any change in the metabolic activities. The highest number of interactions belonged to Hsp90 (429 interactions) while the lowest number of interactions belonged to serotransferrin. BioGrid, as a database, can provide interaction networks for each protein. The stringency of a network can be determined by restricting the level of evidence. For example, for Hsp 90 (Q16543), if a minimum evidence level of 2 is set, then a highly sophisticated network with many interaction partners can be created (Figure 2). On the other hand, if a network was created with a more stringent criteria, such as a minimum evidence level of 5 with sole consideration of physical interactions, then only interactions supported by a strong evidence are displayed (Figure 3). BioGrid also provides information about the interaction partners for each protein but fails to determine whether a group of proteins interact with each other or share a common pathway. To determine the common interaction partners, the interaction partners for each protein were downloaded into an Excel file and cross comparisons were made (Supplementary Material 1, Table 3). Three of the identified proteins, endoplasmin, ATP synthase subunit alpha and 14-3-3, shared many of the common interaction partners, while two of the identified proteins, serotransferrin and glial fibrillary acidic protein, failed to share many of the interaction partners implying that these two proteins may not be helpful in understanding the observed changes occurring at the proteome level. Malate dehydrogenase and vimentin are the two proteins which moderately shared the interaction partners indicating that they are valuable in providing clues about the changes occurring in cell metabolism. However, further bioinformatics analysis is still required to draw more conclusive interpretations.
The STRING analysis
The STRING database integrates protein-protein interactions using direct (physical) and indirect (functional) associations. The database collects and reassesses available experimental data on protein–protein interactions, and imports known pathways and protein complexes from curated databases. What is the advantage of the STRING is that the database is flexible allowing researchers to determine their own stringency conditions. The initial STRING analysis with high confidence and no consideration of maximum number of interactions in the first and second shells and using text mining, experiments and co-expression data as the active interaction sources indicated that three different interactomes could be detected (Figure 4). The first interactome contains glial fibrillary protein, transferrin and vimentin, the second interactome contains malate dehydrogenase and ATP-synthase subunit alpha and the last interactome contains Hsp90 and CDC37. However, it was not possible to produce statistically reliable analysis that would point to a biological process or a cellular event using the preferred settings. When the researcher gradually decreased the stringency of the settings by increasing the maximum number of interactions of 5 in the first and the second shells and considering all the possible interaction sources provided by the database, two disconnected interactomes were generated (Figure 5). The analysis of the results revealed the presence of several statistically significant biological processes. The highest-scoring relevant biological process was determined to be the mitochondrial ATP synthesis coupled proton transport with a false discovery rate of 9.4e-21. If no stringency is applied to the STRING analysis for the data produced by the researcher, such as medium confidence level with maximum number of interactions of 50 for the first and the second shells, and the use of all the active interaction sources, then a huge interactome covering all the detected proteins were generated (Figure 6). However, the complexity of the interactome prevented the researcher to elucidate the significance of any of the interactions displayed on the interactome map. Therefore, the researcher prefers to use high confidence with all active interaction sources and 5 maximum number of interactions on the first and the second shells (Figure 7). The researcher finds out that the detected proteins are part of an interactome associated with cellular respiration. The researcher also detects a minor interactome formed by three proteins including Glial fibrillary proteins and vimentin. The researcher decides to analyze these three proteins separately to elucidate the biological processes associated with high confidence and no more than 10 interactions on the first and the second shells. The separate analysis was able to detect an interactome associated with phosphatidyl inositol-mediated signaling pathway.
Gaining perspectives about protein functions and their interaction partners has paramount importance for unlocking mysteries of the cells. New insights into the mechanics of large protein assemblies have been helping us to understand how proteins talk with each other and work in harmony. One of the ways to deepen our understanding of protein world is to use the tools of bioinformatics. For inexperienced researchers, however, it is rather difficult to assess where to start in interpretation of their experimental data. In this communication, we described how to easily reach relevant biological information for analysis of the proteome data and restructure the necessary information that is needed for interpretation. The order of the use of databases and the relevant software can be extrapolated for analysis of more complicated experimental data sets. Other software such as GeneMANIA (https://genemania.org/), integrative multispecies prediction (http://imp.princeton. edu/), Human net (http://www.functionalnet.org), FunCoup (http://funcoup.sbc.su.se), HINT (http:// hint.yulab.org/), iRefweb (http://wodaklab.org/ iRefWeb/), APID (http://cicblade.dep.usal.es:8080/ APID/init.action) and cytoscape (http://www.cytoscape.org/) may be included to the analysis but for the new comers to the field we kept our analysis as simple as possible. In here, even with a small number of input data, we demonstrated that using the right tools of bioinformatics, a protein network can be created that sheds light onto the observed changes occurring at the proteome level. Using the created network, novel hypothesis can be generated for future experimental investigation.
1. Bruckner A, Polge C, Lentze N, Auerbach D, Schlattner U(2009). Yeast two-hybrid, a powerful tool for systems biology.Int J Mol Sci 10:2763-2788
2. Cheng F, Eriksson JE (2017). Intermediate Filaments and theRegulation of Cell Motility during Regeneration and WoundHealing. Cold Spring Harb Perspect Biol 9:
3. Congiu castellano A, Barteri M, castagnola M, Bianconi A,Borghi E, Della Longa S (1994). Structure-function relationship in the serotransferrin: the role of the pH on the conformational change and the metal ions release. Biochem BiophysRes Commun 198:646-652
4. Gasilova N, Nazabal A (2011). Monitoring ligand modulation ofprotein-protein interactions by chemical cross-linking and HighMass MALDI mass spectrometry. Methods Mol Biol 803:219-229
5. Kerppola TK (2008). Bimolecular ﬂuorescence complementation (BiFC) analysis as a probe of protein interactions in livingcells. Annu Rev Biophys 37:465-487
6. Koos B, Andersson L, Clausson CM, Grannas K, Klaesson A,Cane G, Soderberg O (2013). Analysis of protein interactionsin situ by proximity ligation assays. Curr Top Microbiol Immunol 377:111-126
7. Li Y (2011). The tandem affnity purifcation technology: anoverview. Biotechnol Lett 33:1487-1499
8. Ma L, Yang F, Zheng J (2014). Application of ﬂuorescenceresonance energy transfer in protein studies. J Mol Struct1077:87-100
9. Nishikawa S, Brodsky JL, Nakatsukasa K (2005). Roles ofmolecular chaperones in endoplasmic reticulum (ER) qualitycontrol and ER-associated degradation (ERAD). J Biochem137:551-555
10. Pham ND, Parker RB, Kohler JJ (2012). Photocrosslinking approaches to interactome mapping. Curr Opin Chem Biol 17:90-101
11. Sundell GN, Ivarsson Y (2014). Interaction analysis throughproteomic phage display. Biomed Res Int 2014:176172
12. UniProt Consortium T (2018). UniProt: the universal proteinknowledgebase. Nucleic Acids Res 46:2699
13. Wallace DC (2012). Mitochondria and cancer. Nat Rev Cancer12:685-698
14. Yaciuk P (2007). Co-immunoprecipitation of protein complexes. Methods Mol Med 131:103-111
15. Yang X, Lee WH, Sobott F, Papagrigoriou E, Robinson CV,Grossmann JG, Sundstrom M, Doyle DA, Elkins JM (2006).Structural basis for protein-protein interactions in the 14-3-3protein family. Proc Natl Acad Sci U S A 103:17237-17242
16. Zuehlke AD, Neckers L (2016). Clients Place Unique Functional Constraints on Hsp90. Trends Biochem Sci 41:562-564