关键词:
Breast cancer
Data mining
Principal component analysis
Random forest tree
Blood routine analysis
GENE-EXPRESSION DATA
PREDICTION
BIOMARKER
摘要:
Breast cancer is the most common cancer in women occurring worldwide. Some of the procedures used to diagnose breast cancer are mammogram, breast ultrasound, biopsy, breast magnetic resonance imaging, and blood tests such as complete blood count. Detecting breast cancer at an early stage plays an important role in diagnostic and curative procedures. This paper aims to develop a predictive model for detecting the breast cancer using blood samples data containing age, body mass index (BMI), glucose, insulin, homeostasis model assessment (HOMA), leptin, adiponectin, resistin, and chemokine monocyte chemoattractant protein 1 (MCP-1).The two main challenges encountered in this process are identification of biomarkers and the precision of disease prediction accuracy. The proposed methodology employs principal component analysis in a peculiar approach followed by random forest tree prediction model to discriminate between healthy and breast cancer patients. This approach extracts high communalities, a linear combination of input attributes in a systematic procedure as principal axis elements. The iteratively extracted principal axis elements combined with minimum number of input attributes are able to predict the disease with higher accuracy of classification with increased sensitivity and specificity score. The results proved that the proposed approach generates a higher predictor performance than the previous reported results by opting relevant extracted principal axis elements and attributes that commend the classifier with increased performance measures.