Utilization of bioinformatics tools for understanding proteomics data
Department of 2Biomedical Engineering, Kocaeli University Technology Faculty, Kocaeli
Department of 3Biology, Muğla Sıtkı Koçman University, Muğla
ABSTRACT
ABSTRACT • Scientifc discoveries in life sciences rely on organizing biologicalknowledge and enabling rapid utilization of it. In fact, in most cases interpretation ofgenomics, functional genomics, transcriptomics, metabolomics and proteomics datais much more important than producing it. In World Wide Web, there are many freelyavailable databases and relevant bioinformatics software for interpretation of proteomics data. Some of these databases and relevant software focus on analysis of the coredata. However, interpretation of the output fles from those databases and relevantsoftware can sometimes be challenging. In this study, we aimed at using several often-used freely available databases and relevant software to interpret proteomics data.Our aim was to show how to reach these databases in order and utilize the providedinformation to reduce data complexity. Our experimental setup contained eight proteinspicked from a study performed in our laboratory. Our analysis revealed the presence ofa major and a minor interactomes affected by the changes at the protein levels. We alsodescribed how to easily reach relevant biological information for the effected proteinsand sort out what is needed for data interpretation
INTRODUCTION
In a simple experimental setting, a researcher can measure activity of a protein, determine its concentration in a solution or assess its stability under various experimental conditions. Such information is useful in understanding individual properties of that protein. However, without knowing how proteins behave in a living cell, it is highly difficult to assess their physiological functions because proteins interact with each other and other biological molecules to carry out their tasks. The interactions determine when they ought to be active, how long they should stay active, what type of post translational modifications they are subjected to and to which molecular pathways they should involve with. To determine the nature of interactions among proteins, there are several experimental approaches developed. Co-immunoprecipitation (Co-IP) is considered to be the golden standard to determine protein–protein interactions (1). In Co-IP, the protein of interest is isolated with a specific antibody while its interaction partners are in contact with it. A subsequent mass spectrometry analysis is then used for identification of the interaction partners. To assess whether two proteins interact with each other, the yeast two hybrid system is often used and offers some advantages over Co-IP, such as easy scale-up, involvement of no purification steps or optimizations with regard to binding or washing conditions and possible automatization using robotic platforms (2). Fluorescence resonance energy transfer (FRET) is another experimental approach that detects the proximity of fluorescently labeled molecules over distances of >100 Å (3). FRET can map protein-protein interactions in vivo and offers the opportunity to study the complex behavior of key regulatory proteins in their natural environment. Tandem affinity purification (TAP) is a less preferred experimental approach for studying protein–protein interactions (4). TAP-tag allows determination of protein partners quantitatively in vivo without prior knowledge of complex composition. Other methods to investigate protein-protein interactions include Bimolecular fluorescence complementation (BiFC) (5), Phage display (6), in vivo crosslinking of protein complexes using photo-reactive amino acid analogs (7), Chemical cross-linking followed by high mass matrix assisted laser desorption/ionization (MALDI) mass spectrometry (8) and Proximity ligation assay (PLA) (9). Using these or similar experimental approaches, researchers collected vast amount data regarding protein-protein interactions. To provide a critical assessment and integration of protein-protein interactions using these huge data sets, a number of databases and online resources dedicated to protein networks were created. In this communication, we wish to demonstrate how to make sense of a protein data set that obtained from a proteomics study by using several commonly available bioinformatics tools.
MATERIALS
selection of the Differentially Regulated Proteins
Preliminary data from two-dimensional gel electrophoresis experiments that was carried out inour laboratory to study changes in protein expression levels of breast cancer subtypes were usedand eight differentially regulated proteins out of32 were selected for analysis. The regulation levels were the highest among all indicating that theselected proteins might be good candidates for afunctional analysis. To keep the focus on bioinformatics analysis of the selected data, we did notgive any details about experimental procedures regarding collection of the proteomics data. They willbe published elsewhere.
PubMed Search
PubMed search (https://www.ncbi.nlm.nih.gov/pubmed/) was carried out using appropriate keywords and the names of the regulated proteinsboth in abbreviated and spelled out formats. Thekey words were function, sub-cellular localization,distribution of tissue expression levels, interactionpartners, post translational modification, structure, disease and pathology. In some cases, to limit, widen or define the searches, Boolean parameters AND, OR, and NOT were used. The retrieveddata were manually analyzed and interpreted.
UniProt Search
Uniprot search (http://www.uniprot.org/) was carried out using the names of the regulated proteinsin abbreviated or spelled out formats. To limit theretrieved data, the name of the preferred organism(Homo sapiens) was printed to the search enginealong with the name of the protein. Un-reviewedentries were not retrieved and not used for any ofthe analysis. Entries with high annotation scores(4 to 5) were considered.
The PANTHER Analysis
The PANTHER analysis (Protein AnalysisTHrough Evolutionary Relationships, http://PANTHERdb.org/) was carried out using the Uniprotaccession numbers of the regulated proteins. Theorganism was specified as Homo sapiens and functional classifications were viewed as pie charts.selected ontologies included molecular function,biological process, cellular function, cellular component, protein class and pathway. For each of theontology, manual analysis was performed by listing the selected proteins and cross-checking theirgiven properties with Uniprot entries.
The STRING Analysis
STRING analysis (https://string-db.org/) was carried out using the Uniprot accession numbers ofthe regulated proteins. The search engine optionwas set to “multiple proteins by names/identifiers”and the organism was specified as Homo sapiens.The retrieved proteins were manually checked toassure that they are all correctly retrieved fromthe database. Whole genome analysis was the preferred choice. The setting tab was used to changethe stringency of the analysis. The results weredownloaded as bitmap images and were recreatedby Adobe Illustrator Version 6.
The BioGrid Analysis
The BioGrid analysis (https://thebiogrid.org/) wascarried out using the Uniprot accession numbers ofthe regulated proteins. The organism was specifiedas Homo sapiens and the protein accession numbers were entered to the search engine one at atime. The retrieved data were copied and pastedinto Microsoft Excel for cross comparative analysis. Minimum evidence level was set using a pulldown menu under network tab. The networks weredownloaded as bitmap images and were recreatedby Adobe Illustrator Version 6.
RESULTS
Experimental Scheme and Collection of the Data
A researcher using two-dimensional gel electrophoresis determined changes in expression profiles
of eight proteins in human breast tumor tissues
in comparison to the control tissues. During the
follow up experiments, mass spectrometry analysis was performed and differentially regulated
proteins were identified. The identified proteins
were, namely, Hsp90, endoplasmin, adenosine triphosphate (ATP) synthase subunit alpha, malate
dehydrogenase, serotransferrin, vimentin, 14-3-3
protein theta and glial fibrillary acidic protein. To
evaluate the data, the researcher is needed to gather information about each differentially regulated
protein to determine its function, sub-cellular localization, distribution of tissue expression levels, its
interaction partners, the presence of post translational modifications, its structure and any association with a known pathology.
PubMed Search
The first database the researcher searched was
PubMed. PubMed is a server established by The
National Center for Biotechnology Information
(NCBI) to help advances in science and health by
providing access to biomedical and genomic information. PubMed comprises more than 28 million
citations for biomedical literature from MEDLINE, life science journals, and online books. The
researcher performed a PubMed search using
the protein name and several keywords for each
protein. For example, for endoplasmin alone, 585
publications were retrieved using the key words
function, subcellular localization, expression, interaction, PTM, structure and pathology. For all
the identified proteins, there were more than 2000
publications to sort out for the information the researcher needed. Therefore, a more practical approach was used to gather information about all
the identified proteins.
UniProt Database Search
The second database the researcher searched was
Uniprot. Uniprot is a comprehensive resource for
protein sequence and data annotation and formed
by collaboration among the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of
Bioinformatics (SIB) and the Protein Information
Resource (PIR) (10). To create the Uniprot database, non-curated data retrieved from selected databases e.g., Gene Ontology (for functional information), Pride (for protein identification), InterPro
(to determine protein families and domains), IntAct
(to determine protein-protein interactions) and IntEnz (for enzyme information) are curated and annotated. In addition, signal sequences, transmembrane domains and if not available, domains and
motifs are predicted by UniProt to provide detailed
information. Rather than sorting through all the
publications retrieved from the PubMed, the researcher used UniProt as the source to collect the
information summarized in Table 1. Analysis of the information retrieved from Uniprot database
showed that four of the proteins were cytoplasmic,
two were mitochondrial, one was Endoplasmic Reticulum (ER)-associated and one was a secreted
protein. Two of the cytoplasmic proteins, vimentin and glial fibrillary acidic protein, were structural proteins belonging to class-III intermediate
filaments. The researcher knew that intermediate filaments provide mechanical strength to the
cells and tissues. However, recent studies demonstrated that intermediate filaments can also act as
powerful modulators of cell motility and migration
and may play crucial roles in wound healing and
tissue regeneration, as well as inflammatory and
immune responses (11). It was demonstrated that
mal-functioning of intermediate filaments contributes to a diverse range of pathologies placing them
at the interface between health and disease.
One of the differentially regulated proteins, ATP
synthase alpha, was a mitochondrial protein and
directly participates in production of ATP synthesis via electron transport chain. Another mitochondrial protein that was subjected to regulation was
malate dehydrogenase, which helps production of
nicotinamide adenine dinucleotide (NADH) via
citric acid cycle. Malate dehydrogenase also helps
transport of cytoplasmic NADH to mitochondria
via malate-aspartate shuttle and increase the
yield of cellular energy production. The researcher
observed an increase in ATP synthase levels while
a dramatic decrease was observed in malate dehydrogenase levels. Based on his prior knowledge
about tumor cells and their high energy requirement for fast growth, the researcher was expecting
to observe an increase in the levels of both proteins.
However, mitochondria in cancer cells show active
function of oxidative phosphorylation although citric acid cycle is stalled (12). This may explain the
reason for observing an increase in the levels of
ATPase alpha subunit contrary to the decrease in
the levels of malate dehydrogenase. Further examination of Table 1 by the researcher
revealed that the levels of two cytoplasmic proteins
have changed. One of those proteins was Hsp 90,
a co-chaperone that binds numerous kinases and
helps their proper folding. Hsp90 involves in many
cellular processes by facilitating protein folding,
the binding of ligands to their receptors and the
assembly of multi-protein complexes (13) .
The other cytoplasmic protein the researcher identified was 14-3-3. 14-3-3 proteins have the ability
to bind a multitude of functionally diverse signaling proteins, including kinases, phosphatases, and
transmembrane receptors. This plethora of interacting proteins allows 14-3-3 to play important
roles in a wide range of vital regulatory processes
(14). The researcher was also surprised to see that
an ER associated protein, endoplasmin, was subjected to regulation in tumor cells. Endoplasmin
is also a molecular chaperone and functions in the
processing and transport of secreted proteins. It is
especially associated with ER-associated degradation pathway (ERAD) and its levels were found to
be higher in breast cancer patients with decreased
survival (15). Along the same line, the researcher
realized a change in the level of a secreted protein,
serotransferrin. Serotransferrin is an iron transferring protein and is highly present in the serum
(16). Yet, it is also found in other tissues. The question arose in the mind of the researcher regarding
if there is any association between the changes observed in endoplasmin levels and serotransferrin
levels.
The PANTHER analysis
To associate protein functions with metabolic
pathways, the researcher analyzed the data using PANTHER server (http://pantherdb.org/). The
PANTHER classification system combines protein
function, ontology, pathways and statistical analysis tools to enable researchers to analyze large
scale proteome wide data. PANTHER allows immediate classification of the proteome data based on molecular function, biological process, cellular
component, protein class and pathway. When the
researcher performed the PANTHER analysis,
the following implications were withdrawn from
the data; 1-Changes in protein levels have mainly
affected metabolic processes. 2-The differentially
regulated proteins mostly displayed binding activity. 3-There was no consensus on a single affected
pathway (Figure 1). Unfortunately, the information gathered was not supportive enough to create
a link among the regulated proteins, which might
have been caused by the small number of entries
the researcher used. The researcher might have
needed a large set of proteomics data to create a
link among the differentially regulated proteins
The BioGrid Analysis
BioGrid is an interaction repository database that
searches 64.583 publications for 1.546.181 proteins
and genetic interactions from 66 model organisms.
The researcher used BioGrid database to elucidate
the interaction partners of the identified proteins.
The results generated by the researcher were summarized in Table 2. When a minimum evidence
level of 2 was used, there were vast numbers of interactions detected for each protein indicating that
the proteins were behaving as hubs for balancing
any change in the metabolic activities. The highest
number of interactions belonged to Hsp90 (429 interactions) while the lowest number of interactions
belonged to serotransferrin. BioGrid, as a database,
can provide interaction networks for each protein. The stringency of a network can be determined by
restricting the level of evidence. For example, for
Hsp 90 (Q16543), if a minimum evidence level of
2 is set, then a highly sophisticated network with
many interaction partners can be created (Figure 2). On the other hand, if a network was created
with a more stringent criteria, such as a minimum
evidence level of 5 with sole consideration of physical interactions, then only interactions supported
by a strong evidence are displayed (Figure 3). BioGrid also provides information about the interaction partners for each protein but fails to determine whether a group of proteins interact with
each other or share a common pathway. To determine the common interaction partners, the interaction partners for each protein were downloaded
into an Excel file and cross comparisons were made
(Supplementary Material 1, Table 3). Three of the identified proteins, endoplasmin, ATP
synthase subunit alpha and 14-3-3, shared many
of the common interaction partners, while two of
the identified proteins, serotransferrin and glial
fibrillary acidic protein, failed to share many of
the interaction partners implying that these two
proteins may not be helpful in understanding the
observed changes occurring at the proteome level. Malate dehydrogenase and vimentin are the
two proteins which moderately shared the interaction partners indicating that they are valuable
in providing clues about the changes occurring in
cell metabolism. However, further bioinformatics
analysis is still required to draw more conclusive
interpretations.
The STRING analysis
The STRING database integrates protein-protein
interactions using direct (physical) and indirect
(functional) associations. The database collects
and reassesses available experimental data on
protein–protein interactions, and imports known
pathways and protein complexes from curated databases. What is the advantage of the STRING is
that the database is flexible allowing researchers
to determine their own stringency conditions.
The initial STRING analysis with high confidence
and no consideration of maximum number of interactions in the first and second shells and using text mining, experiments and co-expression data as the
active interaction sources indicated that three different interactomes could be detected (Figure 4).
The first interactome contains glial fibrillary
protein, transferrin and vimentin, the second interactome contains malate dehydrogenase and
ATP-synthase subunit alpha and the last interactome contains Hsp90 and CDC37. However, it was
not possible to produce statistically reliable analysis that would point to a biological process or a
cellular event using the preferred settings. When
the researcher gradually decreased the stringency
of the settings by increasing the maximum number of interactions of 5 in the first and the second shells and considering all the possible interaction
sources provided by the database, two disconnected interactomes were generated (Figure 5). The
analysis of the results revealed the presence of
several statistically significant biological processes. The highest-scoring relevant biological process
was determined to be the mitochondrial ATP synthesis coupled proton transport with a false discovery rate of 9.4e-21. If no stringency is applied to
the STRING analysis for the data produced by the
researcher, such as medium confidence level with
maximum number of interactions of 50 for the first
and the second shells, and the use of all the active interaction sources, then a huge interactome
covering all the detected proteins were generated (Figure 6). However, the complexity of the interactome prevented the researcher to elucidate the
significance of any of the interactions displayed on the interactome map. Therefore, the researcher prefers to use high confidence with all active
interaction sources and 5 maximum number of interactions on the first and the second shells (Figure 7). The researcher finds out that the detected
proteins are part of an interactome associated with
cellular respiration. The researcher also detects a
minor interactome formed by three proteins including Glial fibrillary proteins and vimentin. The
researcher decides to analyze these three proteins
separately to elucidate the biological processes associated with high confidence and no more than
10 interactions on the first and the second shells.
The separate analysis was able to detect an interactome associated with phosphatidyl inositol-mediated signaling pathway.
DISCUSSION
Gaining perspectives about protein functions and their interaction partners has paramount importance for unlocking mysteries of the cells. New insights into the mechanics of large protein assemblies have been helping us to understand how proteins talk with each other and work in harmony. One of the ways to deepen our understanding of protein world is to use the tools of bioinformatics. For inexperienced researchers, however, it is rather difficult to assess where to start in interpretation of their experimental data. In this communication, we described how to easily reach relevant biological information for analysis of the proteome data and restructure the necessary information that is needed for interpretation. The order of the use of databases and the relevant software can be extrapolated for analysis of more complicated experimental data sets. Other software such as GeneMANIA (https://genemania.org/), integrative multispecies prediction (http://imp.princeton. edu/), Human net (http://www.functionalnet.org), FunCoup (http://funcoup.sbc.su.se), HINT (http:// hint.yulab.org/), iRefweb (http://wodaklab.org/ iRefWeb/), APID (http://cicblade.dep.usal.es:8080/ APID/init.action) and cytoscape (http://www.cytoscape.org/) may be included to the analysis but for the new comers to the field we kept our analysis as simple as possible. In here, even with a small number of input data, we demonstrated that using the right tools of bioinformatics, a protein network can be created that sheds light onto the observed changes occurring at the proteome level. Using the created network, novel hypothesis can be generated for future experimental investigation.
REFERENCES
1. Bruckner A, Polge C, Lentze N, Auerbach D, Schlattner U(2009). Yeast two-hybrid, a powerful tool for systems biology.Int J Mol Sci 10:2763-2788
2. Cheng F, Eriksson JE (2017). Intermediate Filaments and theRegulation of Cell Motility during Regeneration and WoundHealing. Cold Spring Harb Perspect Biol 9:
3. Congiu castellano A, Barteri M, castagnola M, Bianconi A,Borghi E, Della Longa S (1994). Structure-function relationship in the serotransferrin: the role of the pH on the conformational change and the metal ions release. Biochem BiophysRes Commun 198:646-652
4. Gasilova N, Nazabal A (2011). Monitoring ligand modulation ofprotein-protein interactions by chemical cross-linking and HighMass MALDI mass spectrometry. Methods Mol Biol 803:219-229
5. Kerppola TK (2008). Bimolecular fluorescence complementation (BiFC) analysis as a probe of protein interactions in livingcells. Annu Rev Biophys 37:465-487
6. Koos B, Andersson L, Clausson CM, Grannas K, Klaesson A,Cane G, Soderberg O (2013). Analysis of protein interactionsin situ by proximity ligation assays. Curr Top Microbiol Immunol 377:111-126
7. Li Y (2011). The tandem affnity purifcation technology: anoverview. Biotechnol Lett 33:1487-1499
8. Ma L, Yang F, Zheng J (2014). Application of fluorescenceresonance energy transfer in protein studies. J Mol Struct1077:87-100
9. Nishikawa S, Brodsky JL, Nakatsukasa K (2005). Roles ofmolecular chaperones in endoplasmic reticulum (ER) qualitycontrol and ER-associated degradation (ERAD). J Biochem137:551-555
10. Pham ND, Parker RB, Kohler JJ (2012). Photocrosslinking approaches to interactome mapping. Curr Opin Chem Biol 17:90-101
11. Sundell GN, Ivarsson Y (2014). Interaction analysis throughproteomic phage display. Biomed Res Int 2014:176172
12. UniProt Consortium T (2018). UniProt: the universal proteinknowledgebase. Nucleic Acids Res 46:2699
13. Wallace DC (2012). Mitochondria and cancer. Nat Rev Cancer12:685-698
14. Yaciuk P (2007). Co-immunoprecipitation of protein complexes. Methods Mol Med 131:103-111
15. Yang X, Lee WH, Sobott F, Papagrigoriou E, Robinson CV,Grossmann JG, Sundstrom M, Doyle DA, Elkins JM (2006).Structural basis for protein-protein interactions in the 14-3-3protein family. Proc Natl Acad Sci U S A 103:17237-17242
16. Zuehlke AD, Neckers L (2016). Clients Place Unique Functional Constraints on Hsp90. Trends Biochem Sci 41:562-564