Analysis of Clinical Prognostic Variables for Triple Negative Breast Cancer Histological Grading and Lymph Node Metastasis

Background: Triple Negative Breast Cancer (TNBC) is a type of breast cancer with very bad prognosis. Predicting the histological grade (HG) and the lymph nodes metastasis is crucial for developing more suitable treatment strategies. Methods: We present the main clinical and pathological variables to predict the histological grade and lymph nodes metastasis via novel machine learning techniques. These variables are currently being used for prognosis and treatment in medical practice. This analysis was performed using a database of 102 Caucasian women diagnosed with TNBC. The results were cross - validated using random simulations of this dataset. Results: HG was predicted with an accuracy of 93.8% using a list of 6 prognostic variables with significant implications: Ki67 expression, use of Oral contraceptives, Col11A1 expression, Col11A1 score, E - cad truncated and Tumor size. The lymph nodes metastasis was predicted with an accuracy of almost 85% using only 6 prognostic variables: Vascular invasion, Tumor size, Perineural invasion, Age at diagnosis, Ki67 expression, and Col11A1 score. This analysis also served to establish the median signatures of the groups with and without lymph node metastasis, and proved the existence of a kind of small - size tumors (around 2.15 cm) with lymph node metastasis but not showing vascular and perineural invasions and higher protein Col11A1 score. Besides, these signatures proved to be very stable. Conclusions: The additional information conveyed by the prognostic variables found in these two classification problems provides new insight about the genesis and progression of this disease and can be used in medical practice to improve decisions in patient diagnosis and further treatment.


Background
Breast cancer is a very heterogeneous disease. This term includes a variety of entities with distinct morphological features and clinical behaviors. For a long time, breast tumors have been classified according to their morphological features (histological type and grade) to ascertain prognostic outcome in patients.
Subsequently, molecular markers were used to provide additional predictive power.
Triple Negative Breast Cancers (TNBC) refers to any breast cancer characterized by the absence of Estrogen Receptors (ER), Progesterone Receptors (PR) and Human Epidermal Growth factor 2 receptors (HER2). A correct classification of TNBC samples is important from a clinical and therapeutic point of view for deciding treatment strategies, since TNBC are resistant to targeted therapies [1,2]. Besides, statistical analyses have shown that TNBC accounts for approximately 15%-25%of all breast cancers [3].
Recently, a molecular classification of tumors based on gene expression profiles was proposed [4] and served to define five different subtypes of breast cancer that were not previously detected using traditional histo-pathological methods [5]. This classification includes the basal-like tumors group which are defined by one of the following conditions [6]: (1) the lack of ER, PR, and HER2 expression; (2) the expression of one or more high-molecular-weight/basal cytokeratins (CK5/6, CK14); (3) the lack of expression of ER and HER2, in conjunction with the expression of CK5/6; and (4) the lack of expression of ER, PR, and HER2 in conjunction with the expression of CK5/6. Among these four cases two match with definition of TNBC. Also, from a morphological point of view basal-like and triple negative breast cancers share a predominance of high histological grades. The analysis of gene expression profiles showed a 77% overlap between TNBC and the intrinsic basal-like subtype, but TNBC also includes some special histological types such as medullary and adenoidcystic carcinoma with low risks of distant recurrence [2,4,7,8].
The treatment options for TNBC are adjuvant chemotherapy and radiotherapy.
Unfortunately, response to chemotherapy does not correlate with overall survival. In addition, recurrences are observed in TNBC during the first and third years after treatment, and most deaths take place in the first five years. The survival decreases after the first distant metastatic event [9]. Therefore, in this heterogeneous group of tumors, new identification and classification techniques are necessary to better predict diagnosis and prognosis in order to establish appropriate therapies and improve patient survival [10].
The histological grade (HG) of the TNBC samples is used to decide the treatment, and it is commonly established according to the Nottingham Histological Score system (the Elston-Ellis modification of Scarff-Bloom-Richardson grading system) [11][12][13]. This system is based on the ability of the tumor to form structures similar to the ducts where the tumor is originated, on the similarity between the cancer cells and the original benign cells, and finally on their proliferating activity. The cells and tissue structure of breast cancer are histopathologically examined to determine how aggressive the cancer is. Lower grade tumors with a better prognosis can be treated less aggressively and have a better survival rate. Higher grade tumors are treated more aggressively causing adverse effects due to more aggressive medications. Therefore, the histological grade assignment plays important role in deciding treatment options of TNBC and also in prognosis. The main variables involved in this grading system are Mitotic count, Nuclear Pleomorphism, and Tubule Formation. The Mitotic count score depends on the field diameter of the microscope used by the pathologist. In the present case, it was established by counting how many mitotic figures are seen in 10 high power fields [14,15]. The Nuclear Pleomorphism score increases with the variation of size and shape of cells, from small nuclei to larger cells with vesicular nuclei [16]. Finally, the Tubule formation decreases with the percentage of tumor area forming glandular/tubular structures [17].
The aim of this research is to provide the main pathological and immuno-histochemical variables that have the greatest predictive accuracy for the most aggressive TNBC histological grades 2 and 3 (named HG2 and HG3) and of lymph nodes metastasis. For that we have used a cohort of 102 Caucasian women diagnosed at Hospital Universitario Central de Asturias (Spain) with TNBC, 96 of them with prescribed histological grade, and 72 of them controlled for lymph nodes metastasis.
The methodology used in this paper is based on machine learning techniques and have been successfully previously applied in the prediction of treatment response in Hodgkin Lymphoma [18] and in addressing different Chronic Lymphocytic Leukemia decision-making problems [19] using clinical data. In the present case, the histological grade was predicted with a leave-out-one-cross-validation (LOOCV) accuracy of almost 64 and the lymph node metastasis with that of 84%. Besides, we provide analysis of the confusion matrix corresponding to the optimum classifier and different associations of prognostic variables with high predictive accuracy that serve to appraise the uncertainty of the corresponding prediction problems and to better understand the genesis of this disease [20,21].

Methods
The methodology presented here aims at assessing the histological grade of new TNBC incoming samples and understanding the main prognostic variables involved in lymph nodes metastasis. The aim of this analysis is also to provide clinicians with expert systems to assist medical decisions.

Dataset Description
A cohort of 102 Caucasian women diagnosed in Hospital Universitario Central de Asturias (Oviedo, Spain), with TNBC and ages between 30 and 94 years were enrolled in this study, which was developed in accordance with the Helsinki Declaration of 1975. This study was approved by the ethics committee (IRB approval 193/17) with the patient informed consent.
Tumor samples were obtained from surgical resection. Samples were fixed in 10% formaldehyde and paraffin embedded, then cut 4μm thick, mounted on treated slides, and stained with Hematoxylin and Eosin (H&E) stain. Finally, these sections were studied and photographed at two different resolutions (100X and 400X) using an Olympus light microscope. Most of the cancers in this cohort were classified as histological grades 2 and 3. The clinical and pathological characteristics of the cohort are provided in the supplementary material (Tables 1 and 2). The survival time in this cohort has a median of 40 weeks and lower and upper quartiles of 25 and 61 weeks respectively.
The TNBC samples are categorized into HG2 when the total score falls between 6 and 7 points and HG3 when it falls between 8 and 9. In our database, 75% of the samples belong to the HG3 group (see table 2). Besides, other variables used by the pathologists are the TNM stage that takes into consideration the Tumor size (T) and the presence of any lymph Nodes metastases (N) or distant organ Metastases (M); the Vascular and Perineural invasion that indicates the presence or absence of tumor cells inside the vessels and nerves, the Nipple and/or skin invasion, and also the Necrosis. The tumor size in our cohort varies between 0.1 and 6.5cm with a median size of 2 cm. Different immuno-histochemical variables were also monitored due to its importance in predicting TNBC prognosis and treatment response [22][23][24][25][26][27][28][29]: the hormone receptor status (ER, PR and Androgen Receptor-AR); HER2, Ki67, Bcl2, p53, CK5/6, CK14 and Col11A1 expressions. Table 3 shows the list of all the clinical variables used in this study, together with their sampling frequency.
It can be observed that all immuno-histochemical variables are sampled in100% of the samples. Variables with sample frequencies lower than 100 are imputed. Perineural invasion was the only pathological variable that needed to be imputed (sampled in 98% of the samples). In the case of clinical characteristics, Menopause is the only variable that has been determined on all samples. On the other hand, the Histological Grade (HG) and the lymph Nodes metastasis (N) will be used for the class assignment in the two-different machine learning classification problems that are analyzed in this paper.

Machine Learning Methodology
The machine learning methodology used in this paper is described in figure 1. The first step consists in pre-processing the database and reading the different clinical variables of the samples that are involved in the class assignment for the different classification problems (histological grade and lymph nodes metastasis).  The discriminatory power of the different variables is established according to their Fisher's Ratio (FR). In a binary classification, the Fisher's ratio of the attriute j is defined as: where μ ji is a measure of the center of mass of the probability distribution of the attribute j in class i, and σ ji is a measure of its dispersion within this class. Discriminatory attributes correspond to higher Fisher's ratios since they have a low intra-class dispersion (intra-class homogeneity) and high inter-class distance that accounts for the separation between the centers of the corresponding prognostic variable distributions.
The algorithm used for prediction and finding the minimum-size list of prognostic variables is composed of 4 steps.

•
The first one is the variable ranking and selection. Attributes are ranked decreasingly according to their Fisher's ratio values. The attribute with the highest score is the main prognostic variable for the class discrimination. The algorithm finds the minimum size list of prognostic variables with the optimum accuracy by adding iteratively attributes with smaller Fisher's ratios in order to span high frequency details of the class discrimination [30][31][32]. The set with the optimum accuracy and minimum size is therefore selected. The accuracy of the different lists is based on Leave-One -Out Cross-Validation (LOOCV) via a simple distancebased classifier built with the reduced set of discriminatory variables [18,19]. LOOCV is a well-established method in which a single sample from the original dataset serves as the validation data (sample test), and the remaining samples as training data. The class assignment is based on a nearest-neighbor classifier in the reduced base, that is, the class with the minimum distance in the reduced base to the sample test is assigned to the sample test. The average LOOCV predictive accuracy is calculated by iterating over all the samples. This algorithm serves to find the Small-Scale Signature with the highest LOOCV predictive accuracy, which provides an estimation of the predictive accuracy of new incoming samples.

•
The second step is the Random sampling of prognostic networks to find other different networks of highly discriminatory prognostic variables. The existence of these networks has been explored in [20,21]. The prior sampling probability of a prognostic variable is considered to be proportional to its Fisher's ratio. That way the most discriminatory variables are preferentially sampled. After the sampling has been accomplished, the most discriminatory networks are determined, and the posterior sampling frequencies of the main prognostic variables involved in these networks are analyzed.
• The third step is the Stability and ROC Analysis. The stability of these signatures is also examined by random 75/25 (75% of samples used for training and 25% for validation) hold-out experiments. The aim of the hold-out procedure is two-fold checking the stability of the predictive accuracy of the smallscales signatures found via LOOCV when the number of training samples is decreased. In this case, the minimum-scale signature is read in the training data set for building the nearest-neighbor classifier and it is applied to establish the small-scale signature predictive accuracy in the validation set. The cumulative distribution function of the smallscale predictive accuracies found in different holdouts is finally presented and serves to account for the variability in its predictive accuracy with partial information. A statistical analysis is performed providing theinter-quartile, standard deviation, mean and median bounds that could be expected in an independent dataset. These cross-validation techniques serve to take into account the effect of the limited size of the biomedical datasets used for training and validation, in order to predict how these signatures would generalize for new incoming samples, since in real problems it is very difficult to find a database with a similar design, to perform an independent validation of the results. In the ROC analysis (see for instance Park et al, 2004) [33]. We provide various metrics of the diagnostic ability of the most predictive signatures derived from the www.openaccesspub.org JMID CC-license DOI : 10.14302/issn.2641-5526.jmid-18-2488 Vol-1 Issue 1 Pg. no.-23 confusion matrix: sensitivity (true positive rate or the probability of detection) and specificity (true negative rate). Finally, the correlation network is built using the minimum spanning tree via Kruskal's algorithm [34] and the Pearson Correlation coefficient among the most discriminatory variables. The weights of this connected graph are the correlation coefficients between the corresponding prognostic variables. The head of the tree is the most discriminatory variable of the corresponding classification problem, whereas the branches contain the variables that are weakly correlated to the headers. The analysis of the correlation networks between prognostic variables might be of help for the physicians to understand the genesis of the disease.
The implicit idea behind this algorithm is that the classification problem becomes linearly separable when the most discriminatory prognostic variables are selected [35,37]. This is a powerful idea based on the principle of parsimony, which should be used in all disciplines. Besides, when these accuracies are low, other nonlinear classification algorithms (black-box models) should be used instead. If despite all these trials, no improvement in the accuracy is observed, then the data set (data and class) is noisy or that the variables do not convey enough information to answer the proposed question [20,21].

Histological Grade Prediction
The aim of this analysis is to establish the discriminatory power of the immuno-histochemical, pathological and clinical variables for HG prediction. For that purpose, we did not use any of the three pathological variables involved in the Scarff-Bloom-Richardson definition: Mitotic count, Nuclear pleomorphism and Tubule formation. This analysis established the optimum variables networks for the HG prediction, and showed how the clinical and pathological variables influence the disease development, particularly the patients' daily habits (oral contraceptives intake, tobacco smoking (or tobacco consumption) and alcohol consumption). We had at disposal the histological grade of 96 TNBC samples: 21 samples in HG 2 and 75 samples in HG3.
The variables used in this classification problem are presented in Table 4, ranked by their discriminatory power given by their Fisher's ratios in decreasing order.
The maximum Fisher's ratio (FR) is 1.28 and corresponds to Ki67 expression, followed by AR expression with a Fisher's ratio of 1.03, and Oral contraceptives with 0.50. The rest of the variables have a lower FR and can only expand high frequency details of the classification problem [37]. In this case, using the most discriminatory variable (Ki67 expression) we have obtained a LOCCV predictive accuracy of 72.9%. The accuracy has increased to 81.3by adding the second discriminatory variable (AR expression), and up to 85.4% by adding Oral contraceptives. The maximum accuracy (90.6%) is obtained using the list containing the 8 first prognostic variables, which is the minimum-size list in this case. This table also shows their mean and standard deviation within each class (HG2 and HG3) and the LOOCV predictive accuracies of the corresponding ranked lists of prognostic variables, as explained in the machine learning algorithm description, and the minimum-size signature with the highest predictive accuracy. Fisher's ratio can be interpreted as a prior discriminatory power of the variables considered individually, while the LOOCV accuracy is the posterior discriminatory power of these variables working in synergy. Table 5 shows the optimum classifier found by the random sampler with an accuracy of 93.8% using a list of only 6 prognostic variables: Ki67 expression, Oral contraceptives, Col11A1 score, E-cad truncated, Tumor Size, and Col11A1 expression and other networks of high discriminatory prognostic variables with a LOOCV predictive accuracy higher than 92%, together with their corresponding stability analysis and ROC analysis. Besides, these high predictive classifiers are very stable, with median accuracies of 91.7% and mean accuracies slightly lower, a low inter-quartile range (8.3) and the standard deviation (5.5) of the predictive accuracy. Subsequently, the ROC analysis shows a very high sensitivity (97%) and specificity (76%).
Besides, we provide a simple linear regression formula to perform a fast and useful estimation of the histological grading: This regression formula has a low RMS error of 0.2 , that is, estimated histological grades lower than 2.3 belong almost surely to HG2. This method complements the HG assessment provided by the Nottingham grading system in appraising this important decision problem concerning the patient treatment and prognosis.

Lymph Nodes Metastasis Prediction
This classification problem tries to predict the presence or absence of lymph nodes metastasis, without making use of the HG variable, nor any of the pathological variables involved in the Nottingham score, and unraveling other prognostic variables at disposal that could be linked to this important problem in TNBC prognosis. In this case, we have at disposal 72 samples where 27 of them had one or two lymph nodes. Table 7 shows the information concerning the ranked lists of prognostic variables used in the lymph nodes metastasis prediction problem. The maximum Fisher's ratio in the Lymph Nodes Metastasis prediction is 0.45 and corresponds to Vascular invasion, followed by Tumor Size (0.19), and Perineural invasion (0.14), meanwhile the rest of variables show a very low FR (close to zero). Due to these low Fisher's ratios, it is expected that this classification problem will be harder in terms of achieving a high predictive accuracy. The maximum accuracy (75%) is provided by the Vascular invasion alone. Then, the LOOCV accuracy drops to 73.61% considering the list of the first seven most discriminatory variables: Vascular invasion, Tumor Size, Perineural invasion, Age First Child, CK14 expression, CK5/6 expression, and E-cad expression. This accuracy remains the same when we also add to the list the Family history. Table 8 presents the optimum classifier found by the random sampler with an accuracy of 84.72% using a list of seven variables: Vascular invasion, Tumor Size, Perineural invasion, Family history, Age at diagnosis, Ki67 expression, and Col11A1 score. We also present and other networks of high discriminatory prognostic variables with a LOOCV predictive accuracy higher than 83%. Their stability analysis shows that the median accuracies vary from 78% to 83.3%, the mean accuracies from 79% to 81.7%, the inter-quartile range from 5.5% to 11% and the standard deviation is around 5 to 8%. In addition, the ROC rates prove a good ability of diagnostic of all the classifiers with sensitivities between 78% and 81% and specificities between 84% and 89%. Table 9 shows the median, mean, interquartile range (IQR) and the standard deviation of the predictive variables of the optimum classifier in the different groups of the confusion matrix. The confusion matrix of the optimum classifier is:  Table   6.    Table 8. Lymph nodes metastasis prediction. Other high discriminatory networks of prognostic variables with predictive accuracies greater than 83% ad their respective stability and ROC analysis. Table   9.   figure 3 shows the correlation network for the Lymph Nodes prediction problem and shows the relationships between the most discriminatory variables.

Discussion
Regarding the most discriminatory prognostic variables of the histological grade, it is interesting to note that women in the HG2 group did not have any Oral contraceptives intake. Population studies aimed at exploring associations between oral contraceptive use and cancer risk have shown that the risks of endometrial and ovarian cancer appear to be reduced with the use of oral contraceptives, whereas the risks of breast, cervical, and liver cancer appear to be increased [30]. Other relevant values related with patients in the HG2 group with respect to the HG3 group are: higher Age at diagnosis, Lactation habits, and number of Pregnancies (an average of 2.3 children for women in HG2 group vs 1.7 in HG3 group); lower tumor size (Tsize) and Tobacco smoking; and lower values of the immuno-histochemical variables, except for the AR (Androgen Receptor) expression. These results provide new insights concerning the clinical features and habits that might influence a better prognosis.
The best prediction of the HG (disregarding the Nottingham grading system) was performed by a list of only 6 prognostic variables: Ki67 expression, Oral contraceptives, Col11A1 score, E-cad truncated, Tumor Size, and Col11A1 expression, with a very stable accuracy (93.8%), sensitivity (97.0%) and specificity (76.0%). Once again, the importance of Oral contraceptives in the HG prediction is highlighted. All these variables are crucial for breast cancer diagnosis and treatment [11][12][13][14][22][23][24][25][26][27][28][29], but their combination has never been explored for HG assignment. The analysis of other equivalent networks has confirmed that Tumor size, Ki67 expression, Oral contraceptives, E-cad truncated, Col11A1 expression, p53 expression and Age at diagnosis are the most important prognostic variables in this prediction problem, and should be compulsory monitored to establish this important medical decision. The role of Ki67 expression as a prognostic marker in breast cancer has been also outlined by [39] in a large-base cohort study, concluding that it is associated with common histopathological parameters and as an additional independent prognostic factor for disease free and overall survivals. The relationship with the epithelial /mesenchymal (EMT) transition, expressed by the presence of ColA11, the truncated E-Cadherin and with the oral contraceptives intake are two main novelties of this analysis, since the samples with null Oral contraceptives intake fall in the HG2 group. Obviously, these values only provide general trends due to the possible presence of behavioral outliers.
The correlation network shows two main branches connecting Ki67 expression to Tumor size and AR expression, both with low correlation coefficients. Two branches start from AR through CK14 expression and E-cad truncated, both weakly correlated to the AR node with negative coefficients. In the tumor size branch, all the variables seem to be related to habits and clinical features, Age at diagnosis, Menopause, Tobacco smoking, Oral contraceptives, etc.
The low correlation among all these variables implies that they should be considered as independent prognostic factors. This graphic also confirms the strong correlation between the three representations of the Col11A1 protein. The role of the Androgen Receptor in breast cancer has been reviewed by [40], concluding that AR expression might play a role during tumor progression. Although histologic grading has become widely accepted as a powerful indicator of prognosis in breast cancer, no connections with other biomarkers has been made relevant. In our opinion this is one major findings of this research that will serve to improve the actual methods of prognosis.
In the case of the lymph nodes metastasis, the most important variables are Vascular invasion, Tumor size, Perineural invasion, Family history, Age at diagnosis, Ki67 expression and COl11A1 score, with a high predictive accuracy (84.7%), sensitivity (78.0%) and specificity (89.0%). All the samples presenting metastasis have positive Vascular invasion (vs almost null in the non-metastasis group), a higher Tumor size mean of 2.74 cm (vs. 1.92 cm), positive Perineural invasion, highest age for first child (25.78 vs 24.62) and higher CK14 and CK5/6 expressions. The analysis of the equivalent networks with accuracies higher than 83% show high stability and a good ability for diagnostic. All these signatures share the Vascular invasion and Tumor Size as leading prognostic variables. Likewise, Col11A1 score, Perineural invasion and/or Necrosis also appear in these networks. The ROC analysis established Vascular invasion and Tumor size as the main differences between the true positive (TP) and true negative (TN) groups, and also showed the existence of a group of TNBC cancers with absence of Vascular and Perineural invasion that presents lymph nodes metastasis (FN group). This kind of cancers have a lower median Tumor size (around 2.15 cm) than the FP group, and a median Col11A1 score value of 2. This knowledge is very important to improve the prediction of Lymph Nodes Metastasis at diagnostic. The correlation network shows one main branch starting from Vascular invasion and linking to Alcohol Consumption and other personal habits (Tobacco consumption) and clinical features (Age at First Child, and Tumor Size). Again, the correlations coefficients among these variables are very low.
Interestingly, the immuno-histochemical variables appear at the base of the tree, indicating their lower importance in the metastasis prediction.
Finally, an interesting remark is that the HG and lymph node metastasis predictions share the Tumor size, Ki67 expression, and Col11A1 score as high discriminatory prognostic variables, confirming a certain link between both problems. Besides, Col11A1 score has a much higher predictive power than the other two representations of this protein. It is not surprising the relationships with vascular and perineural invasions, as well as with the tumor size or ki67 expression, but this analysis provides novel relationships with the expression of ColA11 protein and also with the patient's age.

Conclusions
This study was dedicated to the HG and the lymph nodes metastasis prediction, crucial for developing more suitable treatment strategies. As results, we present the main clinical and pathological variables and their correlation networks for both prediction problems, via novel machine learning techniques. These variables are currently being used for prognosis and treatment in medical practice. HG was predicted with an accuracy of 93.8% using a list of 6 prognostic variables with significant implications: Ki67 expression, use of Oral contraceptives, Col11A1 expression, Col11A1 score, E-cad truncated and Tumor size. The lymph nodes metastasis was predicted with an accuracy of almost 85% using only 6 prognostic variables: Vascular invasion, Tumor size, Perineural invasion, Age at diagnosis, Ki67 expression, and Col11A1 score. This analysis also served to establish the median signatures of the groups with and without lymph node metastasis, and proved the existence of a kind of small-size tumors (around 2.15 cm) with lymph node metastasis but not showing vascular and perineural invasions and higher protein Col11A1 score. Besides, these signatures proved to be very stable. The additional information conveyed by the prognostic variables found in these two classification problems provides new insight about the genesis and progression of this disease and can be used in medical practice to improve decisions in patient diagnosis and further treatment.
We expect that the conclusions attained by this analysis will contribute to improve the understanding, diagnosis and prognosis of this important type of heterogeneous cancers. This methodology could be also used to predict treatment response when this kind of information is available, as we have shown in the case