Abstract
BACKGROUND AND PURPOSE: An MRI of the fetus can enhance the identification of perinatal developmental disorders, which improves the accuracy of ultrasound. Manual MRI measurements require training, time, and intra-variability concerns. Pediatric neuroradiologists are also in short supply. Our purpose was developing a deep learning model and pipeline for automatically identifying anatomic landmarks on the pons and vermis in fetal brain MR imaging and suggesting suitable images for measuring the pons and vermis.
MATERIALS AND METHODS: We retrospectively used 55 pregnant patients who underwent fetal brain MR imaging with a HASTE protocol. Pediatric neuroradiologists selected them for landmark annotation on sagittal single-shot T2-weighted images, and the clinically reliable method was used as the criterion standard for the measurement of the pons and vermis. A U-Net-based deep learning model was developed to automatically identify fetal brain anatomic landmarks, including the 2 anterior-posterior landmarks of the pons and 2 anterior-posterior and 2 superior-inferior landmarks of the vermis. Four-fold cross-validation was performed to test the accuracy of the model using randomly divided and sorted gestational age–divided data sets. A confidence score of model prediction was generated for each testing case.
RESULTS: Overall, 85% of the testing results showed a ≥90% confidence, with a mean error of <2.22 mm, providing overall better estimation results with fewer errors and higher confidence scores. The anterior and posterior pons and anterior vermis showed better estimation (which means fewer errors in landmark localization) and accuracy and a higher confidence level than other landmarks. We also developed a graphic user interface for clinical use.
CONCLUSIONS: This deep learning–facilitated pipeline practically shortens the time spent on selecting good-quality fetal brain images and performing anatomic measurements for radiologists.
ABBREVIATIONS:
- AI
- artificial intelligence
- AP
- anterior-posterior
- DL
- deep learning
- GA
- gestational age
- SI
- superior-inferior
CNS abnormalities are relatively common in fetuses, ranging from 0.1% to 0.2% in live births and 3% to 6% in stillbirths.1 A diagnosis of fetal brain abnormalities at an early stage is essential. Fetal sonography is considered the criterion standard of anatomic measurements. MR imaging is often performed when sonography is inconclusive to provide additional information for assessing fetal anatomy during all phases of gestation.2 MR imaging provides superior soft-tissue contrast and spatial resolution for differentiating highly variable fetal brain tissue.3 Fetal MR imaging combined with fetal sonography increases confidence in the early detection of perinatal disorders of development.
Manual measurements have several disadvantages, including clinicians’ training requirements, time commitment, and inter- and intraobserver variability.4 Radiologists must choose the highest quality image series without motion artifacts or missing anatomy and then identify and measure various anatomic structures.5 Small measurement errors may result in misdiagnosis and misguided pregnancy management.5 Accurate measurements of fetal brain anatomy are critical to differentiate hypoplastic, absent, or malformed brains from normal brain structures.6 Measurement errors can have significant consequences in clinical practice because they can lead to misdiagnosis and misguided pregnancy management.
Fetal MR imaging interpretation also requires specialized training. However, there is a shortage of pediatric neuroradiologists, resulting in limited availability.
During the past decade, artificial intelligence (AI) algorithms, specifically deep learning (DL), have significantly advanced image-recognition tasks.7 The machine learning approaches have the potential to aid in the early detection of these issues, thereby enhancing the diagnostic and follow-up processes.
By means of AI, processing of fetal brain MR imaging has investigated models that automatically predict specific landmarks and segmentation. Various AI models (primarily convolutional neural networks and U-Net) were used.8,9 Some models achieved an accuracy of ≥95%. AI could aid in the pre- and postprocessing10 and reconstruction,11 predicting gestational age (GA) (with an accuracy of 1 week),12 fetal brain extraction,13 and fetal brain segmentation.11,14 AI could also help in GA prediction, fetal motion detection, motion tracking, pose estimation,15 and super-resolution reconstruction.11 Recently, several publications developed AI models for automatic fetal brain anatomic measurements in a biparietal diameter,16 which were derived from identified landmarks after several preprocessing steps such as computation of an ROI, reference section selection, segmentation, midsagittal line and fetal brain orientation, and, finally, measurements.
In this work, we focused on identifying 2 anterior-posterior (AP) landmarks of the pons and 2 AP and 2 superior-inferior (SI) landmarks of the vermis. All the landmarks on each structure were predicted simultaneously using U-Net multisegmentation features. We exploited U-Net17 to determine imaging features surrounding a landmark point and calculated the probability of any image pixel being the defined landmark point.
In addition, the image pixel with the highest probability within the output Gaussian distribution mask was predicted as the landmark point by the U-Net model. Finally, we developed a tool that could be extended to clinical use on the basis of the prediction model to help radiologists select the best image series for interpretation and perform fetal brain anatomic measurements more efficiently.
MATERIALS AND METHODS
Database
This retrospective study, approved by the institutional review board with a waiver of consent, included 55 fetal MR imaging studies at different GAs. The studies were selected from a database of pregnant women who underwent routine clinical fetal screening at Rush University Medical Center, Chicago, Illinois, between 2007 and 2020. All the selected studies confirmed normal fetal brain development based on radiology reports. Expert pediatric neuroradiologists performed image-quality screening and landmark annotation on the exported sagittal T2-weighted HASTE images. Of these 55 patients, some patients had >1 image series, so the total number of image series was 100. We added image series for data augmentation, increasing the data set.
Six landmarks, including AP landmarks on the pons and AP/SI landmarks on the vermis (drawn by the radiologist), served as the ground truth. In addition, manual biometric measurements on the pons and vermis were performed according to the standard clinical recommendations.18
MR Imaging Protocol
Fetal MR images were obtained at our institution using Siemens 1.5T MR imaging scanners (Siemens, Erlangen, Germany), without sedation. Single-shot HASTE images were acquired in the axial, coronal, and sagittal planes with the following parameters: TR = 1400 ms, TE = 120 ms, FOV = 230 × 230 mm2, and section thickness/gap = 3/0 mm, under free breathing. The fetal age range was 20–39 weeks, and the cases did not involve twins or significant maternal risk factors.
Image Preprocessing
All images were resized to 512 × 512 and augmented through rotating, flipping, adding Gaussian noise, motion blurring, median blurring, contrast-limited adaptive histogram equalization, sharpening, embossing, random brightness contrast adjustment, and random hue saturation adjustment using an open-source library (albumentations.ai; https://albumentations.ai/) with default parameter settings.
The study used original MR images without super-resolution reconstruction or other quality-selection preprocessing. A preselection process ensured suitable images for the study, focusing on landmark visibility and differentiation from neighboring structures. Exclusion criteria included motion artifacts, which hindered landmark identification and blurred anatomic borders. Consistency was maintained using midsagittal planes for pons and vermis measurements, avoiding oblique planes. Clinicians selected the data set for pons and vermis annotations, which underwent independent verification by radiologists and AI engineers for accuracy and consistency.
An innovative aspect of our research lies in the use of U-Net for landmark predictions. Rather than using the conventional binary segmentation output of 0/1, we modified the final output layer to generate a distribution map indicating the probability of landmark locations. This novel approach allowed us to extract valuable information from the U-Net model and precisely predict the positions of the landmarks, further enhancing the significance of our article.
Model Performance Validation
The U-Net model was used to fit the Gaussian distribution function. After the radiologist provided specific landmarks, the AI model calculated the distribution probability of these landmarks on the image by collecting the image features.
Model performance evaluation focuses on the probability relationship between image features and landmarks. Given an unsuitable image for labeling the vermis, AI will try to determine where the most likely landmark point will be. However, confidence in this point may be low due to differences in the features learned by image and AI.
With this probability distribution, we feed all the images into the pipeline and filter out the most suitable images for the physician’s annotation.
U-Net Model as a Transforming Function
The encoding path of the U-Net model incorporates convolutional and max-pooling layers for feature extraction and dimension reduction, while the decoding path uses up-sampling and concatenation for restoring spatial resolution and creating segmentation maps.
A U-Net model was built to transform an input MR image into a gray-scale image mask with its vertex representing the location of the predicted landmark point (Fig 1).
The transforming function can be expressed as the impact of an arbitrary point within the mask on the predicted landmark point (Equation 1). where r represents the radial distance. The gray-scale mask is a rotationally symmetric function (radius of a circle: R) with a Gaussian distribution centered on the annotated landmark point (Fig 2), where R is proportional to the fronto-occipital radius of the fetal brain (Fig 3). We chose the Gaussian distribution function as and simplified it by setting the mean and SD as in Equation 2:
In practical applications, we removed the coefficient and replaced it in Equation 1.
We customized the U-Net for landmark-prediction reliability by implementing a Gaussian output. This step permits AI to consider landmark-location uncertainty. The probability of a predicted point being the desired landmark was represented using this Gaussian function. The confidence score for each landmark was based on this probability. The model was trained to detect multiple landmarks simultaneously by generating separate masks representing the Gaussian probability of each landmark.
We balanced model complexity and computational efficiency by optimizing the hyperparameters of the model. We used a 3 × 3 kernel size to capture local contextual information efficiently. The channel depth gradually doubled after each max-pooling operation to learn more complex representations at different levels. The number of layers was chosen considering the task complexity and available computational resources, finding a suitable balance for the model.
Our study demonstrated that splitting the DL model outputs into separate models for pons and vermis landmarks resulted in improved accuracy. This approach allowed fine-tuned adjustments and enhanced detection of each landmark. By focusing on specific imaging features, the individual models improved the identification of anatomic structures. Compared with predicting all landmarks at once with a single model, this approach achieved superior performance and increased accuracy in detecting landmarks in fetal brain MR imaging.
Model Training
In the training process, a normalized weighted binary mean squared error loss function was used to compensate for the data imbalance. Other model parameters included batch size = 15 for both training and validation and epoch = 100 with an early stop after 10 epochs of loss increasing.
We used the Adam optimizer to update model weights efficiently, the EarlyStopping strategy to prevent overfitting, and the ModelCheckpoint callback to save weights of the best-performing model, ensuring reproducibility. The entire training process on the graphics processing unit Nvidia GeForce RTX 2080 Ti 11 Gb (NVIDIA GeForce RTX 2080 Ti 11Gb) took <2 hours.
K-Fold cross-validation
Two 4-fold cross-validation methods were implemented for model training and testing (Fig 4). In the first method, the data set was divided into 4 groups by a sorted range of GAs (ie, 20–26 weeks, 27–29 weeks, 30–33 weeks, and 34–39 weeks) (Fig 4A). In the second method, the data set was randomly divided with mixed GAs without overlapping among all groups (Fig 4B). In each cross-validation fold, 3 groups of patient images were used as the training data set, and the other group, as the testing data set.
RESULTS
Model Performance
The randomly divided mixed GA method outperformed the sorted GA-divided method by providing smaller prediction errors (P < .001) with higher confidence scores (P < .001).
In the first cross-validation method, in which the data set was divided by sorted GA weeks, the prediction error distribution with associated confidence scores for 6 predicted landmarks in 100 image series (ie, the number of total landmarks: 6 × 100 = 600) is shown in Fig 5 (A, scatterplot; B, contour line plot). Among all 600 predicated landmark locations, 73% (440/600) showed a confidence score of >90%, with a mean prediction error of <2.25 mm. Among the 6 landmarks, the anterior vermis and anterior and posterior pons were best predicted with fewer errors and higher confidence scores (Online Supplemental Data). The anterior and posterior pons and anterior vermis had significantly fewer errors compared with the posterior vermis (P < .01, P < .01, and P < .001, respectively). Additionally, the anterior pons had a significantly higher confidence score than the superior/inferior/posterior vermis (P < .05, P < .05, and P < .01, respectively), while the posterior pons had a significantly higher confidence score than the superior/posterior vermis (P < .05 and P < .01, respectively).
In the second cross-validation method, in which the data set was randomly divided with mixed GA weeks, the prediction error distribution with an associated confidence score for the 600 predicted landmarks is shown in Fig 6 (A, scatterplot; B, contour line plot). Among all 600 predicated landmarks, 85% (511/600) showed a confidence score of >90%, with a mean prediction error of <2.22 mm. Among the 6 landmarks, the posterior pons was the best-predicted landmark, with the smallest error and highest confidence score (Online Supplemental Data). The posterior pons had significantly lower error compared with all other landmarks (P < .05 for the anterior pons and anterior vermis, and P < .001 for the superior/inferior/posterior vermis). Additionally, the posterior pons had significantly higher confidence scores than the superior, inferior, and posterior vermis (P < .05, P < .05, and P < .01, respectively).
Automatic Landmark Detection
We evaluated the differences between manual landmark localization performed by a radiologist and an expert pediatric neuroradiologist, as shown in the Table. The variations between their manual measurements ranged from a mean of 0.42 (SD, 0.59) mm for vermis1 to 1.87 (SD, 1.81) mm for vermis2. These disparities emphasize the presence of interrater variability and the possibility of measurement inconsistencies with manual assessments.
By automating landmark identification and using DL capabilities, our AI model consistently provides more reliable and consistent measurements.
Figure 7 shows examples of model-predicted landmark locations with biometric measurements between each pair demonstrated and compared with manual detections performed by the radiologist.
Despite variations in image quality and white noise levels, the AI system maintained accuracy, highlighting its robustness and adaptability. Figure 7 shows the precision and confidence of our AI model in 1 case. The left picture has a pixel spacing of 0.41, resulting in a relatively low resolution. The middle and right images have a pixel spacing of 0.35, which means better resolution. However, the right image has significantly more white noise points. The Online Supplemental Data show confidence and measurements in these 3 different series. The predictions of AI had an error rate below 0.5 mm and provided confidence scores for each prediction, enhancing trust in its outputs. (The AI provided confidence scores for each prediction and selected the optimal predictions with an error rate below 0.5 mm, thereby enhancing the trust of its outputs). Therefore, this AI model offers a reliable, accurate, and consistent tool for measurements.
Distance Measurement.
The Online Supplemental Data illustrate the “distance measurements” of the pons and vermis, comparing manual and AI measurements and the corresponding errors. The error-to-total measurement ratio is reasonable. For instance, when the confidence threshold is set above 90%, the pons distance from the average is 10.81 mm, with an error of 1.12 mm.
Statistical Analysis
Data were transformed to achieve a normal distribution before we conducted statistical testing. The prediction errors of the 6 landmarks and confidence scores were compared using paired t tests across the two 4-fold cross-validation methods. The Tukey Honest Significant Difference test was used to compare and rank the prediction errors and confidence scores for each landmark among all 6 landmarks. All statistical analyses were performed using R Studio (http://rstudio.org/download/desktop). A significance level of P < .05 was used.
Clinical Pipeline with Graphic User Interface
An interactive tool was developed on the basis of our developed model. This graphic user interface was created to help radiologists identify the landmarks of the pons and vermis and obtain biometric measurements on fetal MR imaging more efficiently (Fig 8). This graphic user interface will also provide confidence for each suggested image. To accept, reject, or modify the landmark prediction, the clinicians can judge according to the AI-provided confidence.
DISCUSSION
We proposed a novel U-Net DL model that automatically detects anatomical landmarks on the “pons” and “vermis” in fetal brain MR imaging. Using these landmark locations, we predicted important fetal biometric parameters, including vermis diameter and height, and pons diameter. The critical component of our model was using U-Net as a transformational function to generate a gray-scale image mask with a Gaussian distribution by extracting the imaging features adjacent to the center of the mask. Although we only tested this model for landmarks on the pons and vermis with promising results, it may also be applied to detect landmarks on other brain structures with asymmetric image features.
This study initially used a 2-stage anisotropic 3D U-Net to detect fetal brain ROIs. A reference image section on which the landmarks were identified using a Fetal Measurement by Landmarks was used. A Gaussian Mixture Model estimated the landmark measurement reliability. Compared with fetal MR imaging radiologists, the model yielded a 95% confidence interval agreement of 3.70 mm for cerebral biparietal diameter, 2.20 mm for biparietal bone diameter, and 2.40 mm for transcerebral diameter. Our study is the first to automatically detect landmarks on the pons and vermis on fetal MR imaging using a U-Net as a function with very few training parameters, reducing the computational complexity and shortening the training time.
The accuracy of our model depends on learning imaging features around landmarks. Given the varying fetal brain size and appearance at different GAs, we set the mask size proportional to the fronto-occipital diameter of each brain for adequate feature extraction. Among the 6 landmarks, 2 on the pons and 4 on the vermis, the anterior-posterior pons and anterior vermis had better accuracy, possibly due to distinctive adjacent image features like the fourth ventricle, aiding model learning.
We built a pipeline for automatic batch processing of multiple image series for landmark prediction. It selected the reference section on the basis of the highest confidence score and skipped poor-quality images. The reference section was presented to the radiologists for review, and they were alerted to manually adjust any landmark prediction with a confidence score of <0.8.
In model training and validation, we implemented 3 schemes for 4-fold cross-validation (dividing folds by GA weeks versus dividing folds by mixed weeks). The model trained and validated with mixed GA weeks provides overall better accuracy compared with the sorted GA week–divided approach, suggesting that a large number of imaging features extracted from fetal MR images with various white/gray matter contrast and anatomic details during fetal brain development are essential to include in model training.
In similar studies, Dovjak et al19 conducted manual studies on cerebellar vermian lobulation and vermis/brainstem–specific markers using prenatal MR images. Their research improved hindbrain malformation classification and provided insights into vermian growth patterns. In contrast, our study introduces a DL model for automated identification and measurement of pons and vermis landmarks in fetal MR imaging. Our approach enhances efficiency, reduces errors, and offers confidence scores for predictions.
Advantages of Our Model
The benefit of automated landmark localization: The U-Net algorithm offers the advantage of automated landmark localization and provides associated confidence scores. This feature allows us to process all sagittal sequences of a patient and automatically identify the top 5 fetal brain MR image series with the highest confidence for physicians to choose from or manually adjust.
Time efficiency of the U-Net algorithm: Although the initial implementation and training of the U-Net algorithm require time and resources, its application on new fetal brain MR images is quick and efficient. The model can automatically identify landmarks and provide measurements without manual intervention. On our hardware (GeForce RTX 2080 Ti 12G), the average processing time for our U-Net model to predict 6 markers for an image is as low as 0.23 seconds, which is negligible compared with the time required by a physician for manual screening and measurements.
Consistency and reduced interobserver variability: The U-Net algorithm offers consistent landmark identification across different images and cases, reducing interobserver variability. This standardized approach leads to more reliable and reproducible measurements.
Accommodation of different resolutions and image quality: Our U-Net model robustly handles variations in resolution and evaluates image quality. It assigns confidence scores to each image slice, prioritizing high-quality images for accurate measurements and increased diagnostic confidence, resulting in errors of < 0.5 mm, regardless of resolution and image-quality differences.
Potential of the U-Net algorithm: The model demonstrates promising accuracy levels, potentially matching or exceeding human expert annotations. Further validation and optimization can enhance its reliability for posterior fossa biometry quantification.
Choosing linear measurements: Our study prioritized using linear measurements, specifically the diameter of the pons and vermis, due to their well-established clinical relevance and diagnostic utility. These measurements have been widely adopted in clinical practice and have demonstrated their effectiveness in detecting various brain abnormalities, including vermian hypoplasia, Dandy-Walker malformation, and pontocerebellar hypoplasia. Moreover, monitoring changes in the pons and vermis offers valuable insights into the neurologic development of the fetus, identifying potential issues and evaluating posterior fossa lesions.
The Run Algorithm on a Public Data Set
For an additional test, we ran the algorithm on the larger, publicly available Fetal Tissue Annotation and Segmentation Challenge (https://feta.grand-challenge.org/), resulting in good accuracy. Please see Online Supplemental Data, more codes and examples are provided.
Limitations
This study had some limitations: First, the sample size in each GA week range was small, possibly leading to insufficient model training. We used several strategies to enhance the accuracy and generalizability of our AI model to address the limitation of a limited number of cases. The MR imaging selection was conducted by experienced medical professionals, ensuring a high-quality training, validation, and testing data set. We used “transfer learning,” enabling the model to identify distinctive features across a wider image range, thereby increasing its applicability in diverse clinical scenarios. Despite the size of our data set, the promising performance of the model in this pilot study suggests adaptability across different institutional settings. It provides landmark coordinates and confidence values, giving clinicians flexibility in MR image selection.
Second, this study did not assess fetal brain biometry in pathologic cases because of a paucity of cases with abnormal pons and vermis structures across different GA weeks. The study was limited to a single scanner platform and could potentially be restricted to a single cohort due to the limitation in the cohort. As a result, the generalizability of this study and its utility on large-scale data platforms may be limited. We recognize the need for further validation with larger, diverse data sets to ascertain the robustness and generalizability of our model in varied clinical environments.
Ensemble learning is a valuable option for future studies if additional data are available. This technique involves using all models obtained through 4-fold cross-validation and selecting the landmark with the highest confidence, though it comes at the cost of increased processing time.
CONCLUSIONS
A U-Net model was developed to detect AP landmarks on the pons and AP/SI landmarks on the vermis. Our pipeline includes image series screening and selection, landmark prediction that improves radiologists’ efficiency and time in identifying landmarks, performing anatomic measurements, and screening high-quality images. Using a U-Net-based DL model, we achieved a mean error of <2.22 mm and a ≥90% confidence score in 85% of the testing cases, resulting in improved estimation accuracy with reduced errors.
While manual measurements by radiologists often yield robust results, our AI model brings value by significantly reducing interrater variability and measurement errors. It accurately identifies high-confidence landmarks and optimizes image selection, even in instances in which blurred margins due to motion artifacts are present.
We also established a pipeline, graphic user interface, consisting of imaging selection and landmark prediction, followed by an interactive second check tool to help radiologists quickly locate, confirm, or adjust the landmarks on the autoselected image slices.
Using U-Net as a transformational function, our model accurately extracts imaging features around landmarks, particularly for the anterior and posterior pons and anterior vermis.
We validated this algorithm on a public data set, demonstrating good accuracy. The AI model addresses interrater variability, reduces measurement errors, and saves time, presenting advantages over manual measurements. Implementing AI-driven automation enables comprehensive and efficient fetal brain MR imaging assessment, potentially enhancing radiologists’ efficiency and diagnostic accuracy and improving patient outcomes in fetal brain MR imaging analysis.
Footnotes
The funding of this project is from the Colonel Robert R. McCormick Professorship of Diagnostic Imaging fund at Rush University Medical Center, and the Swim Across America Pilot Project Grant from Rush University Medical Center.
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
References
- Received February 6, 2023.
- Accepted after revision August 2, 2023.
- © 2023 by American Journal of Neuroradiology