Identification of a Novel Epithelial–Mesenchymal Transition Gene Signature Predicting Survival in Patients With HNSCC

Head and neck squamous cell cancer (HNSCC) is one of the most common types of cancer worldwide. There have been many reports suggesting that biomarkers explored via database mining plays a critical role in predicting HNSCC prognosis. However, a single biomarker for prognostic analysis is not adequate. Additionally, there is growing evidence indicating that gene signature could be a better choice for HNSCC prognosis. We performed a comprehensive analysis of mRNA expression profiles using clinical information of HNSCC patients from The Cancer Genome Atlas (TCGA). Gene Set Enrichment Analysis (GSEA) was performed, and we found that a set of genes involved in epithelial mesenchymal transition (EMT) contributed to HNSCC. Cox proportional regression model was used to identify a four-gene (WIPF1, PPIB, BASP1, PLOD2) signature that were significantly associated with overall survival (OS), and all the four genes were significantly upregulated in tumor tissues. We successfully classified the patients with HNSCC into high-risk and low-risk groups, where in high-risk indicated poorer patient prognosis, indicating that this gene signature might be a novel potential biomarker for the prognosis of HNSCC. The prognostic ability of the gene signature was further validated in an independent cohort from the Gene Expression Omnibus (GEO) database. In conclusion, we identified a four-EMT-based gene signature which provides the potentiality to serve as novel independent biomarkers for predicting survival in HNSCC patients, as well as a new possibility for individualized treatment of HNSCC.


INTRODUCTION
Head and neck cancer is the sixth most common types of cancer, with about half a million new cases annually diagnosed worldwide [1], among which 350,000 individuals die of it [2]. Although local control rate and quality of life have improved for head and neck squamous cell cancer (HNSCC) owing to advances in surgical techniques and comprehensive treatment techniques, overall survival (OS) has not increased significantly in recent decades. Moreover, the 5-years survival rate of patients with this disease is only 40-50% [3]. In recent years, age, clinical stage, and smoking status which are characteristics emerging as important contributors to the clinical outcome might help improve survival prediction for patients afflicted by HNSCC [4][5][6]. However, due to the complex molecular mechanisms underlying cancer regulation, the conventional clinical information has limited predictive ability.
Increasing evidence has revealed the important clinical significance of mRNA expression in various pathological and physiological processes of multiple tumor histotypes, including HNSCC. For instance, decreased calpain six expression has been known to be associated with tumorigenesis and poor prognosis of HNSCC [7]. Furthermore, FcGBP expression was upregulated by HPV infection and correlated with longer survival time of HNSCC patients [8]. However, compared with single-gene biomarkers, tumor signatures including several genes have been identified, which might be better choices to facilitate clinical application, provide insights into cancer progression, as well as reveal potentially new therapeutic targets [9,10]. Thus, it is necessary to establish an expression-based gene signature for predicting survival of HNSCC patients for effective clinical decisions making with respect to optimal treatment regimen.
Metastasis is a complex, highly inefficient, but deadly process that has been under intense investigation in hopes to eliminate distant spread of metastasis and reduce cancer-associated mortality. A key event in promoting stationary tumor cells to migrate and invade is the epithelial mesenchymal transition (EMT) [11]. Specific tumor cell populations with EMT are associated with more aggressive tumor phenotypes and worse outcomes [12,13]. Therefore, understanding the relationship between EMT and tumor is crucial for elucidating the underlying mechanism of tumorigenesis. Gene Set Enrichment Analysis (GSEA) is not concerned with a limited number of different genes that have significantly changed, but with whether the expression of these detected genes exhibit a common expression pattern in the defined functional groups, and interprets biological information to elaborate its biological significance from another perspective. In the present study, we tried to identify the relationship between the metastatic cascade system and HNSCC by GSEA analysis [14]. As expected, EMTrelated risk signature could independently identify patients afflicted by HNSCC with high-risk and poor prognosis.

Data Acquisition
We downloaded the mRNA expression profiles and clinical information of head and neck squamous cell cancer patients from the TCGA database (https://cancergenome.nih.gov/). A total of 546 samples involving tumor tissue samples and adjacent noncancerous tissues participated in the study, and their corresponding clinical data including age, sex, TNM classification, OS status, as well as disease-free survival (DFS) status were also examined. The general clinical features are listed in Table 1. As an external validation cohort, the independent data set GSE27020 based

Gene Set Enrichment Analysis
GSEA (http://www.broadinstitute.org/gsea/index.jsp) was performed to explore whether the gene sets identified between the two groups showed a significant difference [15,16]. We conducted GSEA to investigate the differences of biological pathways associated with tumorigenesis and progression between adjacent noncancerous tissue and HNSCC samples by selecting the hallmark gene sets as the reference gene set file.
Normalized p values (p < 0.05) were set as threshold to determine which functions for further investigation.

Gene Ontology Analysis
We used DAVID (the Database for Annotation, Visualization and Integration Discovery v6.8, https://david.ncifcrf.gov/) to analyze the TCGA-HNSCC cohort, to obtain the enriched pathways for differentially expressed genes between four-mRNA-low-risk and high-risk groups [17].

Statistical Analysis
Using gene expression profiles as the original data, each gene was normalized by log2 transformed values for further analysis. Univariate Cox regression analysis was used to calculate the association between mRNA expression levels and OS rate of patients. mRNAs were considered significant if their p-values were less than 0.05. Candidate genes were fitted in a stepwise multivariate Cox proportional regression model to identify predictive models with optimal interpretation and valuable information. We used the R package "survival" to build a risk score model. The formula for the risk score is described below: Risk score expression of gene 1*β1+ expression of gene 2*β2+. . .+expression of gene n*βn. The filtered mRNAs were classified into risky [hazard ratio (HR) >1] and protective (0 < HR < 1) types. All patients were classified into a high-risk and a low-risk group according to the median risk score. Additionally, the correlations between the risk score and clinical features of HNSCC patients were analyzed by using Chi-square test. Time-dependent receiver-operating characteristic (ROC) curves and area under the ROC curve (AUC) values were calculated to measure prognostic accuracy. Afterward, we applied the R package "caret" to randomly divided TCGA-HNSCC patients into two sets (TCGA validation set 1, n 346 and TCGA validation set 2, n 148) at a ratio of 7:3. The prognostic signature was subsequently validated in both internal validation sets and external independent data set GSE27020. All statistical analyses were performed using SPSS 24.0 and GraphPad Prism8 software.

Using GSEA for Preliminary Screening of Genes
Clinical features of TGCA-HNSCC cohort (n 546), and expression data sets of HNSCC patients were obtained from TCGA. GSEA was applied to explore whether the gene sets identified as EMT, G2/M checkpoint, E2F targets, MYC targets V2 and MYC targets V1 showed statistically significant differences between HNSCC samples and adjacent normal tissues. We found that all the five gene sets were significantly enriched with normalized p-values < 0.05 ( Figure 1; Table 2). We then selected the EMT gene set which was of our interest (p 0.013), containing 94 core genes (Supplementary Table  S1) for further analysis.

Identification of EMT mRNAs Related to HNSCC Patient Survival
We first performed a preliminary screening of 94 genes by univariate Cox regression analysis and obtained 12 genes (Supplementary Table S2) with a p-value <0.05. Next, using multivariate Cox regression analysis, we further examined the association between the 12 mRNA expression profiles and patient survival. Subsequently, we confirmed four-mRNA signature (WIPF1, PPIB, BASP1 and PLOD2), with a p-value < 0.001, as an independent prognostic indicator of HNSCC. Filtered mRNAs were classified as risky type (PPIB, BASP1 and PLOD2) with β(cox) was >0 and shorter survival, and a protective type (WIPF1), with β(cox) was <0 and longer survival ( Table 3). Differential expression of four genes in adjacent normal tissues was also investigated compared to HNSCC tissues. We found that these four genes were up-regulated in tumor tissues with significant differences (p < 0.05, Figure 2).

Four-mRNA Signature Was Constructed to Predict the Prognosis of Patients
Based on the linear combinations of the expression level, the formula for prognostic risk rating was established, and the expression level was weighted by the regression coefficients from the multivariable Cox regression analysis. Risk score 0.19805 * expression of PPIB + 0.09904 * expression of BASP1 + 0.15977 * expression of PLOD2−0.23734 * expression of WIPF1. Every HNSCC patient had only one risk score. We calculated the scores and then ranked the patients in order of increased risk scores. Based on the median point (0.994727), we then classified them into high-risk and low-risk groups ( Figure 3A). The distribution and survival status for each patient is shown in Figure 3B. The median survival time of the highrisk and low-risk group was 998 and 1762 days, respectively. In addition, the mortality rate of the high-risk group was 48.6%, whereas the corresponding rate in the low-risk group was 38.7%. The mortality rate of patients with a high-risk score was higher than that of patients with a low-risk score (p 0.0268). According to the univariate Cox regression analysis of OS, compared to the low-risk group, a 1.573-fold increased risk of death (95% CI 1.200-2.062, p 0.001) was determined for the high-risk group. Further, the heat map displayed the expression profiles of four-mRNAs ( Figure 3C). HNSCC patients with an increased risk score showed significantly upregulated expression of high-risk type mRNA (PPIB, BASP1 and PLOD2); in contrast, the expression of protective type mRNA (WIPF1) was down-regulated.

Risk Scores Generated by Four-mRNA Signature as Independent Prognostic Indicators in HNSCC
The results of Chi-square analysis showed that patients in the lowrisk group had better clinicopathological parameters, including tumor grade (χ 2 10.619, p 0.014), neoplasm cancer status (χ 2 6.948, p 0.008), new tumor event after initial treatment (χ 2 5.535, p 0.018), compared with those in high-risk cohort .302, p < 0.001) were still independent prognostic indicators in the multivariate Cox analysis. To assess the capabilities of the four-EMT-based classifier to predict OS of HNSCC, we plotted the ROC curves and the AUC value for the long-time survival at 3 years of our signature was 0.620 ( Figure 4A). In consideration of the role of classical parameters in clinical practice, we combined the EMT-based model and the classical clinical parameters to predict OS of HNSCC, and the AUC value was 0.716, indicating that this model was more accurate than models enrolled in EMT-related signature or clinical parameters solely.

Kaplan-Meier Curves for Survival Predicted Four-mRNA Signature
Kaplan-Meier curves and the log-rank method showed that patients with high-risk scores had poorer OS time (p 0.0008; Figure 4B). Meanwhile, the signature showed great utility in predicting DFS with p-value of 0.0202 ( Figure 4C). Additionally, in TCGA validation set 1 (n 346) and validation set 2 (n 148), the Kaplan-Meier curves displayed significant differences between high-risk and low-risk patients (p < 0.05, Figures  4D,E). We also extracted HNSCC afflicted patients from the GSE27020 dataset (n 109) and applied the same formula to validate the ability of the four-EMT-based classifier predicting OS of HNSCC. As expected, patients in the low-risk group had a significantly longer survival time compared to the high-risk group (p 0.0034; Figure 4F). Univariate Cox regression analysis of OS showed that several clinicopathological parameters were effective predictors of HNSCC survival, including clinical N, clinical M, margin status, lymphangitic invasion and perineural infiltration. Then Kaplan-Meier curve was used to verify the above conclusion, and the results were self-consistent ( Figures 4G-I). According to the curve, patients with lymphangitic invasions and peripheral nerve invasions had a poor prognosis during follow-up.
Stratified analysis for further data mining. As shown in Kaplan-Meier curve, regardless of age (<61 or ≥61), four-mRNA signature in HNSCC patients are stable prognostic markers, due to the poor prognosis of patients with a highrisk score ( Figure 5A). Additionally, when patients were stratified into different subgroups based on gender and HPV P16 status the four-mRNA risk score remained an independent prognostic indicator for male HPV P16-negative ( Figures 5B,C), suggesting that HNSCC may require further investigation. Similarly, when patients underwent radiation therapy during follow-up (with positive alcohol status and absence of perineural invasion), we could use risk scores to predict patient results, which showed that patients in the high-risk subgroups had poor survival (Figures 5D,E,G,H).

Functional Enrichment Analysis of the Four-mRNA Signature in HNSCC
We performed enrichment analysis to elucidate the biological function of the four-mRNA signature target genes. The GO categories comprised of three structured networks: biological processes (BP), cellular components (CC) and molecular function (MF). The top ten enriched functional analysis is shown in Figure 6.  The top enriched biological process was proteolysis (associated with 10 genes); the top enriched cellular component and molecular function were integral component of membrane part (associated with 53 genes) and calcium ion binding (associated with 14 genes), respectively. Therefore, a total of 12 KEGG pathways were enriched by the four-mRNA signature. The significantly enriched KEGG pathways were the cancer-related pathways (associated with seven genes), cytokine-cytokine receptor interaction pathways (associated with seven genes) and neuroactive ligand-receptor interaction pathways (associated with seven genes).

DISCUSSION
Recent studies have shown that clinicopathological features such as gender, age, tumor margin status, and metastasis are not sufficient for accurately predict the prognosis of patients. As a result, more and more mRNAs might be examined as molecular markers at an increasing rate to predict cancer development and prognosis, indicating that its clinical significance needs to be explored [18]. For example, analysis of the prognostic value of a single lncRNA from qRT-PCR array of 84 gastric cancer patient samples found that higher level of BANCR could predict a poor prognosis for GC patients [19]. The high expression of FAM83H-AS1 is involved in the progression of bladder cancer and serves as a prognostic biomarker and potential therapeutic target for patients with bladder cancer [20]. Compared to single biomarker, integrating multiple biomarkers into the aggregation model could improve the prognostic value [21]. In particular, the expression of a single gene could be controlled by a variety of factors, and hence might not provide a strong predictive effect. Therefore, a statistical model was constructed gene signature contains multiple genes, combined with the effect of each component gene prediction, improve forecasting efficiency. This model has been used widely, and is superior to single biomarkers in predicting disease prognosis [22][23][24].
With the development of high-throughput genetic testing technology, we are entering a new era of big biological data [25]. Currently, RNA sequencing or micro array data for gene mutations and expression levels often use Cox proportional hazard regression models to construct new prognostic features [26,27]. In this study, we attempted to apply GSEA using expression data of 57,072 genes of patients from TCGA-HNSCC cohort, and we found significant differences in five biological functions with p-values < 0.05. As mentioned above, we focused on EMT, using GSEA to select genes to predict patient survival rather than exploring a broad range of genes. Univariate and multivariate Cox regression analyses were performed to identify combinations of four genes with predictive values for HNSCC patients, rather than just one gene [21]. Compared to some known predictive biomarkers, the risk profile of this option may have a more targeted and prognostic ability to support positive clinical results and be an effective classification tool for HNSCC patients. In addition, this study used bioinformatics methods to explore the risk characteristics of mRNA and its clinical significance, providing a novel method for mining potential prognostic markers, which not only supports our previous understanding of HNSCC, but also lays the foundation for future studies. It should be noted that, Kaplan-Meier analysis showed that risk score on survival is small and appears even smaller than lymph vascular invasion or perineural invasion. Therefore, more comprehensive and homogeneous datasets are needed to assess the four-mRNA signature before clinical applications. Notably, the stratification analysis found that the four-mRNA signature could generate superior performances for predicting the survival benefit in patients with P16 negative or alcohol or age <61, indicating that the four-mRNA signature could be helpful in predicting HNSCC prognosis, especially in P16 negative or alcohol or age <61 patients. It is highly significant given that P16 positive tumors typically have a good prognosis regardless of stage or high-risk pathologic features, this prognostic mRNA signature classifier for P16 negative-related HNSCC may help clinicians to pinpoint those HNSCC patients at high risk of unfavorable OS. EMT is recognized to be important in cancer cell migration/metastasis. For example, to metastasize, disseminated mesenchyme-like tumor cells of well-differentiated carcinomas must regain their epithelial function (invasion and metastasis) [28,29]. The disruption of epithelial-cell homeostasis leading to aggressive cancer progression is correlated with the loss of epithelial characteristics and the acquisition of a migratory phenotype, known to EMT, and is considered to be a crucial event in malignancy [30,31]. EMT and mesenchymal-related gene expression are associated with aggressive breast cancer subtypes and poor clinical outcome in breast cancer patients [32,33]. Recently, some studies have reported the occurrence and development of gene expression induced EMT and metastasis in HNSCC [34,35]. Additionally, several studies have been conducted which have predicted that in HNSCC patients, survival is associated with EMT. For instance, TEAD4 might act as a putative oncogenic gene by enhancing the proliferation, migration, and invasion of HNSCC cancer cells [36]. However, the EMT gene signature used to predict HNSCC prognosis have not been established. In the present study, we reported an EMT gene signature of four genes (PPIB, BASP1, WIPF1 and PLOD2), identified using bioinformatics methods which demonstrate the prognostic value of HNSCC. Of the four candidate genes, three have positively correlated coefficients in the prognostic model and correlated with poor survival. PPIB, called cyclophilin B, regulates the protein conformation of its substrate through prolyl cis-trans-isomerization in endoplasmic reticulum (ER) lumen and nucleus, and possesses multiple functions, including chemotaxis and prolactin signaling [37][38][39]. Additionally, it is a novel wild-type p53 (p53WT)-inducible gene [40]. In cancer biology, PPIB is associated with malignant progression and regulation of genes involved in the pathogenesis of gastric cancer [41],hepatocellular carcinoma [42], pancreatic cancer [43] and is considered as a candidate biomarker for these cancers. However, the exact molecular mechanism by which PPIB leads to cancer cell survival is unclear. BASP1 (brain acid soluble protein 1), was initially isolated from 23,000 brain extracts. BASP1 comprises an effector domain which dynamically couples to the plasma membrane and is also involved in neuronal sprouting process [44]. Recently, some reports have identified BASP1 as a transcriptional co-suppressor for WT1 protein [45]. WT1 can promote tumor cell proliferation, inhibition of apoptosis, and interacts with cytoskeletal proteins to promote migration, invasion and angiogenesis [46]. WT1 gene is expressed highly abnormally in a variety of tumors, including breast [47], thyroid [48,49], non-small cell lung cancers [50], and HNSCC [51], and is considered to have the characteristics of an oncogene. PLOD2, also called LH2, is the key enzyme mediating the formation of stabilized collagen cross-links [52]. Gain and loss of function studies show that LH2 hydroxylated telopeptide lysine residues on collagen, shifted the tumor stroma toward higher levels of hydroxylysine aldehyde-derived collagen cross-links (HLCCs), lower levels of lysine aldehyde-derived cross-links (LCCs), increased tumor stiffness, and enhanced tumor cell invasion and metastasis [53,54]. PLOD2 is overexpressed in different cancers such as bladder cancer [55], renal cell carcinoma [56], and oral carcinoma [57], and is closely associated with poor prognosis [58]. All these three genes are associated with EMT in cancer. In contrast, WIPF1 confers an onco-protective effect. WIPF1, also known as WIP, the WASP-interacting protein (WIP) drives the oncogenic activity of mutant p53. WIPF1 knockdown in glioblastoma and breast cancer cells expressing mtp53 greatly reduced the proliferation and growth capacity of cancer stem cell-like cells and reduced the expression of cancer stem cell-like markers (CD44, CD133 or TAZ/YAP). WIPF1 knockdown inhibits the growth of glioblastoma tumor cells and breast cancer cells in vivo [59]. Since EMT is affected in HNSCC, the targeted genes might be useful in controlling cancer. Finally, the prognostic value of the four genes in HNSCC still needs further experimental studies to elucidate biological function.

CONCLUSION
In conclusion, we identified and verified a four-gene risk signature related to EMT that can predict the survival of HNSCC patients, wherein a high-risk score shows poor patient prognosis, suggesting that this gene signature could be a potential new biomarker for HNSCC prognosis. Further investigation of these genes will provide theoretical guidance for basic research and for the clinical treatment of HNSCC.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
WX, CZ, LJ, DP, LZ, CZ Writing-original draft preparation: WX; Methodology: CZ and LJ. Formal analysis and investigation: DP; Writing-review and editing: LZ and CZ; All authors critically reviewed the manuscript in its entirety and approved the final content.