Screening and Validation of Independent Predictors of Poor Survival in Pancreatic Cancer

Pancreatic cancer is a digestive system malignant tumor with high mortality and poor prognosis, but the mechanisms of progression remain unclear in pancreatic cancer. It’s necessary to identify the hub genes in pancreatic cancer and explore the novel potential predictors in the prognosis of pancreatic cancer. We downloaded two mRNA expression profiles from Gene Expression Omnibus and The Cancer Genome Atlas Pancreatic Cancer (TCGA-PAAD) datasets to screen the commonly differentially expressed genes in pancreatic cancer by limma package in R. Subsequently, measurement of the functional similarity among the 38 DEGs in common was performed to identify the hub genes using GOSemSim package. Then, survival analysis and Cox regression were applied to explore prognosis-related hub genes using the survival package. Statistics analysis by two-tailed Student’s t-test or one-way based on TCGA-PAAD datasets and qPCR detection in clinical samples were performed to explore the correlations between expression of hub genes in pancreatic cancer tissues and clinical parameters. Based on integrated analysis of TCGA and GEO datasets, we screened 38 DEGs in common, which were all up-regulated. The functional similarity results showed that 10 DEGs including TSPAN1, MSLN, C1orf116, PKP3, CEACAM6, BAIAP2L1, PPL, RAB25, ERBB3, and AP1M2 in the DEGs in common, which had the higher average functional similarity, were considered as the hub genes. Survival analysis results and Cox regression analysis showed that TSPAN1, CEACAM6, as well as ERBB3 were all associated with poor overall survival of PC. qPCR results showed that the expression levels of TSPAN1 and ERBB3 were significantly upregulated in the PC tissues. The statistical analysis results revealed that TSPAN1 expression correlated significantly with histologic grade, T stage, clinical stage, and vital status by two-tailed Student’s t-test or one-way ANOVA; ERBB3 expression correlated significantly with T stage, clinical stage, and vital status by two-tailed Student’s t-test or one-way ANOVA. We found that TSPAN1 and ERBB3 could be independent predictors of poor survival in pancreatic cancer.


INTRODUCTION
Pancreatic cancer (PC) is a common digestive system malignant tumor that is characterized by high mortality and poor prognosis. There are more than 458,918 estimated new cases and 432,242 estimated death cases every year around the world in 2018 [1]. Due to its high malignancy, the 5-years survival rate of PC patients is only 10% [2]. Identification of new independent prognostic biomarkers is still of great significance for patients with PC for improving treatment and prognosis of PC patients.
With the advent of the era of big data, a variety of tumorrelated public databases have sprung up, which provides a large amount of genomic data and its corresponding clinical data for oncology basic medicine and translational medicine. Over the last decades, public databases, including the well-known The Cancer Genome Atlas (TCGA) database, Gene Expression Omnibus (GEO) database, etc., have been widely applied in screening of the molecular mechanisms of PC, which could provide powerful support for identification of effective and accurate prognosis predictors of PC patients [3,4].
In the present study, we identified differentially expressed genes (DEGs) in PC based on a comprehensive analysis of Gene Expression Omnibus (GEO) and The Cancer Genome Atlas Pancreatic Cancer (TCGA-PAAD), and obtained the hub genes through measurement of functional similarity. Subsequently, survival analysis and Cox regression analysis were applied for the identification of prognosis-related hub genes. Statistics analysis in different subgroups and qPCR detection in clinical samples were used to verify the correlation between hub genes and clinical parameters. Therefore, this study was designed to explore novel potential prognosis predictors using bioinformatics and validation in clinical samples (Figure 1).

Data Extraction and Analysis of GEO Datasets
PC and adjacent non-tumor tissue gene expression profiles of GSE16515 [5] and GSE32676 [6] were obtained from the GEO database. The DEGs between PC tissues and adjacent non-tumor tissues from the two cohort profile data sets (GSE16515 and GSE32676) were screened using the limma package in R [7]. The corresponding clinical information is as follows in Supplementary Tables S1, S2.

Validation in TCGA Datasets
The PC gene expression profile and corresponding clinical data for pancreatic cancer were collected from Firebrowse (http:// firebrowse.org). The demographic and clinical characteristics of the corresponding PC patients are as follows in Supplementary  Table S3. The RNA-Seq by Expectation-Maximization (RSEM) expression values were used for statistical analysis. The DEGs between PC tissues and adjacent non-tumor tissues from TCGA datasets were identified using the limma package. Heatmaps of common DEGs among the TCGA-PAAD datasets and 2 GEO datasets were developed by pheatmap package [8].

Identification of Hub Genes Based on the Semantic Similarities of Gene Ontology Terms
After integrated analysis, we performed the measurement of the functional similarity among the DEGs in common to identify the hub genes using the GOSemSim package, which was based on the semantic similarities of Gene Ontology (GO) terms used for gene annotation [9]. The top ten DEGs with the higher average functional similarity were considered as the hub genes.

Functional Analysis of Hub Genes
We performed Kaplan-Meier curves to compare overall survival between the high and low expression group for hub genes using the survival package in R [10], and P values were calculated by the log-rank test. The volcano plots were generated to visualize expression differences for discrete variables by the ggplot2 package in R [11].

Statistical Analysis
Statistical analysis of the results and boxplots was performed using GraphPad Prism 7.0 software. The comparisons of the mean values of the analyzed parameters were performed using two-tailed Student's t-test and one-way ANOVA. The false discovery rate (FDR) controlling was used to correct the p-value with the Benjamini Hochberg algorithm implemented in R 4.0.3 suite (Lucent Technologies). FDR<0.05, P-value<0.05, and |logFC| (fold change) > 1 was considered statistically significant.

Screening and Verification of Differentially Expressed Genes in GEO Datasets and TCGA-PAAD Datasets
PC and adjacent non-tumor tissue gene expression profiles of GSE16515 and GSE32676 were obtained from NCBI-GEO. We used p < 0.05, FDR<0.05 and |logFC|>1 as cut-off criterion. After integrated bioinformatics analysis, a total of 38 upregulated DEGs was identified from the two GEO profile datasets and TCGA-PAAD RNA-Seq expression datasets ( Table 1; Figure 2). We developed heatmaps of the 38 upregulated DEGs in common based on the expression profiles, showing the significantly differential distribution of the 38 DEGs ( Figure 3).

Identification of Hub Genes Based on the Semantic Similarities of Gene Ontology Terms
We performed the measurement of the functional similarity among the common DEGs to identify the hub genes using the GOSemSim package. The results showed that 10 DEGs including TSPAN1, MSLN, C1orf116, PKP3, CEACAM6, BAIAP2L1, PPL, RAB25, ERBB3, and AP1M2, had the higher average functional similarity, which was considered as the hub genes ( Figure 4). The distributions of functional similarities were summarized as Supplementary Table S7.

TSPAN1 and ERBB3 Acts as Independent Prognostic Factors for Poor Survival of Pancreatic Cancer
Survival analysis results showed that high-level expression of TSPAN1 (p 0.008), CEACAM6 (p 0.037), and ERBB3 (p 0.013) was all associated with poor overall survival of PC patients ( Figure 5 and Supplementary Table S6). Moreover, a further subgroup analysis showed that high TSPAN1 expression was associated with poor overall survival of patients with G1 tumors (p 0.035), and N1 stage (p 0.027); high ERBB3 expression was associated with poor overall survival of patients with N1 stage (p 0.011); high CEACAM6 expression was associated with poor overall survival of patients with G1 tumors (p 0.002) (Supplementary Figure 1). Univariate analysis results showed that some clinical characteristics including T stage (p 0.026), N stage (p 0.005), TSPAN1 (p 0.008), CEACAM6 (p 0.039), as well as ERBB3 (p 0.014) expression were associated with poor overall survival ( Table 2). Subsequently, clinical characteristics with a P-value < 0.05 were included for multivariate analysis, and the results showed that TSPAN1 (p 0.04), ERBB3 (p 0.046), and N stage (p 0.012) expression were associated with poor overall survival in PC ( Table 2).

Validation of Hub Gene Expression in Clinical Samples
To validate the bioinformatics results, we analyzed the expression of the 10 hub genes in carcinoma and para-carcinoma tissues by qPCR. The results showed that the expression levels of TSPAN1 (p < 0.01), RAB25 (p < 0.05), and ERBB3 (p < 0.01) were significantly upregulated in the PC tissues ( Figure 6), which is consistent with the bioinformatics analysis results.

DISCUSSION
In recent years, the continuous expansion of public databases including TCGA and GEO databases, has provided new means for tumor research and valuable first-hand data for clinicians' evidence-based practice and clinical research. In the study, we identified 38 DEGs in common in PC, which were all up-regulated after integrated analysis of GEO and TCGA-PAAD datasets. Through assessment of functional similarity for common DEGs, we found that the top ten DEGs had the highest average functional similarity, and were considered as the hub genes of 38 DEGs. Then, survival analysis, Cox univariate and multivariate analysis results indicated that the expression of TSPAN1 and ERBB3 was both negatively correlated with the prognosis of pancreatic cancer, especially those with N1 and/or G1. In addition, the expression level of TSPAN1 and ERBB3 was both correlated with patients' survival status, T stage, clinical stage, and TSPAN1 was correlated with histologic grade.
Detection results in clinical samples by qPCR also demonstrated that TSPAN1 and ERBB3 were upregulated in PC tissues. Therefore, TSPAN1 and ERBB3 could be involved in the prognosis of PC. Tetraspanins expressed on the cell surface, are 4transmembrane spanning proteins with a relative molecular mass of 25 × 10 3 -50×10 3 , which are widely expressed in different species including at least 33 different family members of mammals [12,13]. They interact within the molecule or with other proteins to regulate various basic cellular physiological processes, such as cell differentiation, cell migration, and signal transduction. TSPAN1 (Tetraspanin 1) is a member of the Tetraspanins family of proteins whose important feature is their ability to aggregate with one another or various other transmembrane receptors, to become TSPAN-enriched microdomains (TEMs), which are essential in determining the fundamental biological activities such as cell adhesion, proliferation, and cell motility [12,14]. A few studies have reported that TSPAN1 could play a significant role in the progression of PC [15]. A study by Hou et al. showed that transfection with siRNA-targeting TSPAN1 significantly decreased proliferation, increased apoptosis, and reduced migration and invasion of AsPC-1 and PANC-1 cells, which suggested that TSPAN1 was involved in the PC of migration and invasion [16]. Studies performed by Tian J et al. and Zhang X et al. also reached a similar conclusion in Capan2, BxPC3, and SW1990 cells through silencing TSPAN1 [17,18]. Another study has identified TSPAN1 as one of the potential diagnostic markers and was expressed at a high level in PC tissues and cells, which was consistent with our study results [19]. Ma C et al. identified TSPAN1 could be involved in PC progression and act as a critical biomarker for diagnosing and predicting patient survival with PC through weighted gene co-expression network analysis [20]. Even so, the role of TSPAN1 in the prognosis of PC is not yet clear. In the present study, we found that the expression levels of TSPAN1 were significantly upregulated in PC based on TCGA-PAAD and GEO database, and the result was validated in clinical samples. Moreover, PC patients with high TSPAN1     ERBB3 (Erb-b2 receptor tyrosine kinase 3), as a member of the ErbB family, is highly expressed in a variety of common tumors and can induce resistance to a variety of tumor therapeutic drugs when activated [21,22]. In the previous study, the inhibition of ERBB3 by miR-148a was reported to repress the phosphorylation of ERK1/2 and AKT, thereby inhibiting the proliferation and migration of PC [23]. In vivo and in vitro experiments from another study conducted by Liles JS confirmed that high expression of ERBB3 promotes tumorigenesis of pancreatic adenocarcinoma [24]. We found that ERBB3 was highly expressed in PC tissues and negatively correlative with an overall survival rate of PC patients in our study. Moreover, the expression level of ERBB3 was involved in the prognosis and progression of PC including pathology T stage, clinical stage, and vital status.

CONCLUSION
In conclusion, through the integrated analysis of the TCGA-PAAD and the GEO datasets, as well as the validation in clinical samples, high TSPAN1, and ERBB3 expression could act as independent prognostic factors for poor overall survival in PC. In the future, we will explore the potential mechanism of TSPAN1 and ERBB3 in PC by further validation.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The ethics committee of China-Japan Union Hospital of Jilin University. The ethics committee affiliates with China-Japan Union Hospital of Jilin University. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
JS and XZ conceived, supervised, and guided implementation of the experiments; SL and EC conducted the experiments, collected and organized the data, and analyzed the results; SL, and YC wrote the original draft; all the authors revised the manuscript and approved the final version.