Breast cancer is one of the causes of cancer-related death in all parts of the world, and in this case, early and accurate cancer diagnosis is the key to enhancing patient survival and treatment results. Even though medical imaging techniques, e.g., mammography and ultrasound, are commonplace, the current computer-aided diagnostic strategies tend to be ineffective in terms of spatial interdependencies and complementary information among different imaging modalities. In order to overcome this drawback, this paper introduces a multi-modal breast cancer detection model, which aids in the representation of mammogram and ultrasound images as a graph, and uses the representational characteristics of Graph Neural Networks (GNNs) to learn complex spatial relationships among regions of interest. Two high-level architectures are explored, one is a Spatial-Temporal Graph Neural Network (ST-GNN) designed to learn the contextual appearance of the spatial dimension, and the other is an LSTM-enhanced GNN (LSTM-GNN) designed to learn the temporal appearance of the temporal dimension. The experiment is carried out on a clinically validated dataset of 205 cases, and 405 high-resolution mammogram and ultrasound images of participants gathered with the help of the standardized medical imaging device. Quantitative analysis proves that the ST-GNN is much better than the LSTM-GNN, which has an accuracy of 85, a precision of 74, a recall of 93, an F1-score of 82, and an AUC of 0.87, indicating the efficiency of spatial graph modeling in breast cancer detection. The findings validate the hypothesis that multi-modal fusion based on graphs will offer a strong and scalable solution to improve diagnostic accuracy, sensitivity, and clinical decision support in automated systems of breast cancer screening.