Abstract
BACKGROUND AND PURPOSE: MR imaging can be used to measure structural changes in the brains of individuals with multiple sclerosis and is essential for diagnosis, longitudinal monitoring, and therapy evaluation. The North American Imaging in Multiple Sclerosis Cooperative steering committee developed a uniform high-resolution 3T MR imaging protocol relevant to the quantification of cerebral lesions and atrophy and implemented it at 7 sites across the United States. To assess intersite variability in scan data, we imaged a volunteer with relapsing-remitting MS with a scan-rescan at each site.
MATERIALS AND METHODS: All imaging was acquired on Siemens scanners (4 Skyra, 2 Tim Trio, and 1 Verio). Expert segmentations were manually obtained for T1-hypointense and T2 (FLAIR) hyperintense lesions. Several automated lesion-detection and whole-brain, cortical, and deep gray matter volumetric pipelines were applied. Statistical analyses were conducted to assess variability across sites, as well as systematic biases in the volumetric measurements that were site-related.
RESULTS: Systematic biases due to site differences in expert-traced lesion measurements were significant (P < .01 for both T1 and T2 lesion volumes), with site explaining >90% of the variation (range, 13.0–16.4 mL in T1 and 15.9–20.1 mL in T2) in lesion volumes. Site also explained >80% of the variation in most automated volumetric measurements. Output measures clustered according to scanner models, with similar results from the Skyra versus the other 2 units.
CONCLUSIONS: Even in multicenter studies with consistent scanner field strength and manufacturer after protocol harmonization, systematic differences can lead to severe biases in volumetric analyses.
ABBREVIATIONS:
- NAIMS
- North American Imaging in Multiple Sclerosis Cooperative
- T1LV
- T1-hypointense lesion volume
- T2LV
- T2 lesion volume
Conventional MR imaging is an established tool for measuring CNS lesions and tissue compartment volumes in vivo in individuals with multiple sclerosis. In the brain and spinal cord, inflammatory demyelinating lesions appear hyperintense on T2-weighted images. Total cerebral T2 lesion volume (T2LV) is a key metric for the longitudinal monitoring of disease severity, as well as a standard outcome in clinical trials of MS therapeutics.1⇓–3 Many T2 lesions exhibit pulse-sequence-dependent hypointensity on T1-weighted images, which has been shown to be associated with more severe (destructive) histopathology and worse clinical outcomes.4⇓⇓⇓–8 MR imaging is also used to measure cerebral atrophy, a commonly used supportive outcome measure of the neurodegenerative aspects of the disease in both relapsing-remitting and progressive forms of MS.9⇓⇓⇓⇓⇓⇓⇓⇓–18 Together, lesion and atrophy measures provide complementary quantitative information about disease progression that are considered central to patient assessment.19
Unfortunately, differences in acquisition methods have the potential to bias MR imaging metrics. Factors such as equipment manufacturer, magnetic field strength, and acquisition protocol can affect image contrast and resultant volumetric data. Indeed, several groups have investigated the reliability of volumetric measurements across scanners,20⇓⇓⇓⇓⇓⇓–27 but little is understood about the variability in volumetric measurements of lesions and atrophy in individuals with MS. Furthermore, many automated segmentation algorithms depend on statistical atlases or models that are built with healthy volunteers or that depend on registration, which can be compromised by the presence of MS pathology.28
The North American Imaging in Multiple Sclerosis Cooperative (NAIMS) was established to accelerate the pace of imaging research. As a consortium, our first aim was to facilitate multicenter imaging studies by creating harmonized MR imaging protocols across sites. In this article, we describe initial results from our pilot study, which tested the feasibility of multisite standardization of MR imaging acquisitions for the quantification of lesion and tissue volumes. We compare inter- to intrasite scan-rescan variability in various MR imaging output metrics with consistently acquired 3T acquisitions.
Materials and Methods
Participant
A 45-year-old man with clinically stable relapsing-remitting MS and mild-to-moderate physical disability was imaged at 7 NAIMS sites across the United States (Table). He developed the first symptoms of the disease 13 years before study enrollment and had been relapse-free in the previous year after starting dimethyl fumarate. His last intravenous corticosteroid administration was 5 years previously. His timed 25-foot walk at study entry was 5.3 seconds. His Expanded Disability Status Scale score was 3.5, both at study entry and exit, without any intervening relapses on-study. The participant signed informed consent for this study, which was approved by the institutional review board of each site.
Scan Acquisition
Through consensus agreement in the Cooperative, NAIMS developed a standardized high-resolution 3T MR imaging brain scan protocol. All imaging was acquired with Siemens scanners, which, at the time of the study, were used by most NAIMS sites. Scan-rescan pairs were acquired on these scanners; the most relevant acquisition sequences are shown in the Table. At each site, the scan-rescan experiment was performed on the same day, with the participant removed and repositioned between scans. None of the participant's scans were coregistered to each other, to replicate a “real world” clinical trial setting. The volunteer was also imaged at the National Institutes of Health NAIMS site at the beginning and end of the study (5 months later) to assess disease stability. Raw MR imaging scans were distributed to 4 NAIMS sites for postprocessing.
Expert Lesion Tracing
De-identified images underwent manual quantification to assess total cerebral T1-hypointense lesion volume (T1LV) and T2LV from the native 3D FLAIR and T1 images by the consensus of trained observers (G.K., F.Y.) under the supervision of an experienced observer (S.T.). For T2LV, this process involved manually identifying all lesions on the FLAIR images. For T1LV, lesions were required to show hypointensity on T1-weighted images and at least partial hyperintensity on FLAIR images. The lesions were then segmented by 1 observer (G.K.) with a semiautomated edge-finding tool in Jim (Version 7.0; http://www.xinapse.com/home.php) to determine lesion volumes. Images were presented to the same reading panel for all of the above steps in random order in 1 batch and mixed into a stack of 50 other MS images to reduce scan-to-scan memory effects and preserve blinding.
Automated Analysis
Several fully automated pipelines were also used to estimate T2LV and the volumes of total brain, normal-appearing white matter, and both cortical and deep gray matter structures. To prevent overfitting, we used all pipelines with their default settings, according to published recommendations for each method separately, in which appropriate images were inhomogeneity corrected, rigidly aligned across sequences from each scan session, processed for removal of extracerebral voxels for all processing pipelines, and intensity normalized. For lesion measurements, several algorithms were applied by the laboratories that developed or codeveloped the various methods: Lesion-TOADS (TOpology-preserving Anatomical Segmentation; https://www.nitrc.org/projects/toads-cruise/),29 a fuzzy C-means-based segmentation technique with topologic constraints; Automated Statistical Inference for Segmentation (OASIS),30 a logistic-regression-based segmentation method leveraging statistical intensity normalization; Subject Specific Sparse Dictionary Learning (S3DL; https://www.nitrc.org/projects/s3dl/),31 a patch-based dictionary learning multiclass method; and White Matter Lesion Segmentation (WMLS; https://www.nitrc.org/projects/wmls/),32 a local support vector machine-based segmentation algorithm developed for vascular lesions that also uses corrective learning. To estimate the volume of gray matter structures, we used Lesion-TOADS; FMRIB Integrated Registration and Segmentation Tool (FSL-FIRST; http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST)33 (a Bayesian appearance method); Multi-atlas Segmentation with Brain Surface Estimation (MaCRUISE)34 (a combined multiatlas segmentation and cortical reconstruction algorithm); and MUlti-atlas region Segmentation utilizing Ensembles of registration algorithms (MUSE)35 (an ensemble multiatlas label-fusion method). The FSL-FIRST33 analysis was applied directly to the raw T1 images according to common practice, and OASIS30 was applied to the T1, FLAIR, and a 3D T2 high-resolution sequence after preprocessing; all other pipelines were applied to appropriately preprocessed T1 and FLAIR images. Not all algorithms measured volumes of the same set of structures. Lesion-filling was not performed. Lesion-TOADS, MaCRUISE, and MUSE also yielded estimates for total brain volume.
Statistical Analysis
All statistical analyses were conducted in the R software environment (http://www.r-project.org/).36 To compare estimated volumes within and across sites, we computed mean volumes and SDs. T tests were also used for differences in within-site averages between scanner platforms. Correlations between these averages across segmentation algorithms were also explored. The proportion of variation explained by site was computed, and the association with site was assessed with permutation testing. The coefficients of variation were also estimated across sites. To assess associations between session-average measured total brain and lesional volumes and time of day (morning versus afternoon), we used Wald testing within a linear model framework, both marginally and adjusting for scanner platform.
Results
The participant was found to be stable regarding cerebral lesion load during the study. When we compared images acquired at the National Institutes of Health at study entry and exit, the manually measured T2LV in the participant was similar (17.9 mL in September 2015 versus 17.8 mL in February 2016). The T1LV was also stable (15.5 versus 15.1 mL). This imaging stability paralleled his clinical stability (see “Materials and Methods”).
The manually estimated T1LV and T2LV for each scan is shown in Fig 1. Site explained 95% of the variation observed in the estimated T2LV and 92% of the variation in the estimated T1LV, indicating marked scanner-to-scanner differences despite protocol harmonization, which clearly exceeded scan-rescan variability within sites. The range of T2LVs was 15.9 to 20.1 mL, indicating that differences of up to 25% of the lesion volume were observed across sites. The range of T1LVs was similarly wide, ranging from 13.0 to 16.4 mL. Further inspection of these volumes across platforms indicated that Skyra (Magnetom Skyra; Siemens, Erlangen, Germany) scanners showed larger lesion volumes compared with other Siemens platforms both on T1LV (Skyra: mean T1, 15.2 mL compared with non-Skyra: mean T1, 13.8 mL; P < .05) and T2LV (Skyra: mean T2, 18.9 mL compared with non-Skyra: mean T2, 16.6 mL; P < .01). An example of the segmented lesions across scanners is provided in Fig 2.
Results from the automated techniques for delineating and measuring T2LV are shown in Fig 3. The automated lesion segmentations showed marked disagreement in the average lesional volume measurements compared with the manually assessed volumes, and all methods showed large site-to-site differences (in some cases up to 7.5 mL, or almost 50% of the manually measured lesion volume), except for Lesion-TOADS (range, 10.5–11.0 mL), which was more stable. For all methods, site explained >50% of the observed variation; 53% of the variation was explained by site (permutation P = .36) for S3DL, 54% for Lesion-TOADS (P = .41), 44% for OASIS (P = .57), and 83% for WMLS (P = .002), which clearly was most prone to site-related variation.
To measure brain structure volumes, we used several automated methods. As an example, results for the thalamus are shown in Figs 4 and 5. While Lesion-TOADS estimated smaller volumes, MUSE, FSL-FIRST, and MaCRUISE yielded similar average measurements. Nonetheless, site was strongly associated with measured thalamic volume, explaining 96% of the Lesion-TOADS volume variation (P < .01), 89% of MUSE (P < .01), 84% of FSL-FIRST (P = .04), and 65% of MaCRUISE (P = .17). Similar results for the putamen, caudate, cortical gray matter, normal-appearing white matter, and total brain volume were found, as provided in On-line Figs 1–5. Summaries of the coefficient of variation give an intuitive measure of the scale of the combined scan-rescan and across-site variation as shown in Fig 6. Finally, the proportion of variation explained by site is shown in Fig 7. Note that in almost all cases, site explained >50% of the variation, with most measurement techniques showing >80% variation due to site for all structures assessed.
While all images were acquired on 3T Siemens scanners, the model type appeared to influence the results; there was evidence of systematic differences in many measurements between Skyra and non-Skyra scanners. Figure 8 shows the negative log P values for the comparison of volumes averaged across scan-rescan measurements, with larger values indicating more systematic differences between platforms. The largest platform-associated differences were observed in MaCRUISE measurements of normal-appearing white matter, cortical gray matter, and, consequently, total brain volume. Lesion-TOADS also showed large differences in total brain volume attributable to cortical gray matter, as did S3DL for T2LV measurements. MUSE showed major differences in thalamic volume across scanner models, and FSL-FIRST showed similar discrepancies in the thalamus and caudate. The correlation between site-averaged measurements varied dramatically, especially for lesional and total brain volume measurements (On-line Fig 6); this variation indicates that site differences resulted in contrasting effects on output from the different algorithms. While the other measurements showed less scanner model–related variation, most still showed prominent differences between Skrya and non-Skrya scanners.
The time of day of scan acquisition was not associated with manually segmented T1 lesion volumes (t = 0.45) or T2 lesion volumes (t = 0.38) or total brain volume, as measured by any of the automated algorithms (On-line Figs 7 and 8).
Discussion
Clinical MS therapeutic trials have traditionally used 1.5T MR imaging platforms to provide metrics on cerebral lesions and atrophy as supportive outcome measures. However, there is growing interest in the use of high-resolution 3T imaging to assess disease activity and disease severity in MS. Such 3T imaging has the potential for increased sensitivity to lesions37,38 and atrophy,39 higher reliability,39,40 and closer relationships to clinical status,38,39 compared with scanning at 1.5T. The purpose of this study was to evaluate the consistency of metrics obtained from a single MS participant with a high-resolution 3T brain MR imaging protocol distributed to 7 sites. The results of our study indicate that even in multicenter acquisitions from the same scanner vendor after careful protocol harmonization, systematic differences in images led to severe biases in volumetric analyses. These biases were present in manually and automatically measured volumes of white matter lesions, as well as in automatically measured volumes of whole-brain and gray and white matter structures. These biases were also highly dependent on scanning equipment, which resulted from a higher sensitivity to lesions in newer scanners from the same manufacturer compared with earlier models, even at the same field strengths.
In comparison with past estimates of reliability of volumetric measurements of brain structures, our findings point to higher between-site variation than previously documented. In particular, Cannon et al27 reported that between 3% and 26% of the observed variation in global and subcortical volumes were attributable to site; this was a study of 8 healthy participants imaged on 2 successive days across 8 sites with 3T Siemens and GE Healthcare scanners. However, the proportion of explained variation has a different interpretation from that reported here. The total variation in Cannon et al consisted of 4 contributors to variance: first, across-site differences; second, across-scan differences; third, across-day differences; and fourth, across-subject differences. In our single-participant study, we isolated only the first 2 variance components, allowing us to compare variation because it is relevant for precision medicine (subject-specific) applications.
Previous work indicated that the observed variation attributable to scanning occasion was small25,27; indeed, Cannon et al27 found this to constitute <1% of the variation. Thus, we did not scan our participant on subsequent days but rather simply repositioned the participant between scans during the same imaging session. A notable difference between our study and that of Cannon et al is that we did not use data from a standardized phantom concurrently acquired for correction of between-scanner variations in gradient nonlinearity and scaling. Cannon et al found that this correction improved between-site intraclass correlations and greatly reduced differences between scanner manufacturers. Similarly, Gunter et al41 reported the usefulness of a phantom for scanner harmonization and quality control in the Alzheimer's Disease Neuroimaging Initiative (http://www.adni-info.org/). In future studies, we will focus on applying phantom calibrations across NAIMS sites to extend our current observations. Despite the growing literature on the importance of diurnal variation and hydration status for volumetric analyses,42⇓⇓–45 we found no significant associations between time of day and measured volumes. This may indicate that in single-participant analyses, time of day and day-to-day variation may be of less concern than the much larger source of variation of scanner platform. Most interesting, Cannon et al also found that measurements acquired with scanners from the same manufacturer and similar receive coils had higher reliability. In our study, we found that even scanner models (ie, Skyra versus non-Skyra) from the same manufacturer varied markedly in their estimates of lesion volume; this variation highlights the importance of between-scanner differences for assessing MS-related structural changes.
To assess differences across processing pipelines, we used a variety of techniques for automated segmentation of lesion and white and gray matter volumes. Different segmentation algorithms showed a range of variability in their estimates, as well as their sensitivity to differences between scanners. For example, Lesion-TOADS showed much less variable lesion measurements than any other technique and was not as sensitive to differences in scanner platform. Lesion-TOADS was the only unsupervised lesion-segmentation technique used. Contrast differences between the participant data and the training data of the other supervised methods could be associated with greater sensitivity to scanner differences, and this might be mitigated by specific (albeit potentially laborious) tuning to individual platforms. However, while sensitivity to biologic change is generally higher for methods yielding less noisy estimates, because only a single individual was studied here, our data cannot be taken to indicate that Lesion-TOADS is superior to other methods of estimating thalamic volume, for example. Additionally, both purely intensity-based segmentation algorithms, OASIS and WMLS, appeared to be more sensitive to site differences, which may indicate that methods that rely more on topology, shape, or spatial context may be more stable across scanners. This finding indicates that across-scanner differences may be driven by contrast differences rather than geometric distortions. Future investigation to extend these findings could involve quantitative contrast-to-noise and signal-to-noise comparisons across scanners. Allowing segmentation parameters to vary across sites could also help stability.
A limitation of this study is its single-subject and single time point design, which makes the generalizability of the findings dependent on further investigation. In particular, the degree to which across-site differences might vary by lesion burden and degree of atrophy, as well as demographic variables, requires additional study. Future larger studies of multiple participants across disease stages, including longitudinal measurements, are necessary for understanding the implications of the biases described in this pilot study. Indeed, such studies would also allow the assessment of the trade-off between stability in measures across sites, with sensitivity to biologic differences. Differences between scanning equipment and scanner software versions have also been noted in past studies of reliability,23,25,27,46,47 but their implications for the assessment of pathology remain unclear. In particular, repeat acquisitions on scanners with different receive coils could provide additional insight concerning reliability. In addition, our study was from a single time point across scanners, whereas clinical trials rely on the quantification of intrasubject longitudinal change.48 Each participant is typically scanned on the same platform, which may limit the variability in on-study change between participants. Further studies are necessary to assess whether scan platform introduces the same level of acquisition-related variability when assessing longitudinal changes.
Given the intersite differences observed in lesional measurements, across-site-inference statistical adjustment for site is clearly necessary when analyzing volumetrics from multisite studies, even when images are acquired with a harmonized protocol on 3T scanners produced by the same manufacturer. From a single participant, it is unclear what the role of differential sensitivity to lesions might be across individuals with heterogeneity in lesion location. For example, while lesion detection in the supratentorial white matter might be more straightforward and comparable across individuals, detection of lesions in the brain stem, cerebellum, and spinal cord may be more sensitive to differences in equipment. New statistical methods for measuring and correcting systematic biases are warranted, especially for studies in which patient populations may differ across sites. Indeed, intensity normalization and scan-effect removal techniques49⇓⇓⇓⇓⇓–55 (akin to batch-effect removal methods in genomic studies56) are an active area of methodologic research and promise to improve comparability of volumetric estimates from automated segmentation methods. After volumes are measured, statistical techniques for modeling estimated volumes from multicenter studies are also rapidly evolving.18,57 These techniques bring the potential to mitigate site-to-site biases in group-level analyses, with better external validity at the cost of increased sample size.
Conclusions
By imaging the same subject with stable relapsing-remitting MS during 5 months, we assessed scanner-related biases in volumetric measurements at 7 NAIMS centers. Despite careful protocol harmonization and the acquisition of all imaging at 3T on Siemens scanners, we found significant differences in lesion and structural volumes. These differences were especially pronounced when comparing Skyra scanners with other Siemens 3T platforms. The results from this study highlight the potential for interscanner and intersite differences that, unless properly accounted for, might confound MR imaging volumetric data from multicenter studies of brain disorders.
Our findings raise a key issue of the interpretability of MR imaging measurements in the context of personalized medicine, even in carefully controlled studies with harmonized imaging protocols.
Acknowledgments
The following is a full list of individuals who contributed to this NAIMS study—Brigham and Women's Hospital, Harvard Medical School (Boston, Massachusetts): Rohit Bakshi, Renxin Chu, Gloria Kim, Shahamat Tauhid, Subhash Tummala, Fawad Yousuf; Cedars-Sinai Medical Center (Los Angeles, California): Nancy L. Sicotte; Henry M. Jackson Foundation for the Advancement of Military Medicine (Bethesda, Maryland): Dzung Pham, Snehashis Roy; National Institutes of Health (Bethesda, Maryland): Frances Andrada, Irene C.M. Cortese, Jenifer Dwyer, Rosalind Hayden, Haneefa Muhammad, Govind Nair, Joan Ohayon, Daniel S. Reich, Pascal Sati, Chevaz Thomas; Johns Hopkins University (Baltimore, Maryland): Peter A. Calabresi, Sandra Cassard, Jiwon Oh; Oregon Health & Science University (Portland, Oregon): William Rooney, Daniel Schwartz, Ian Tagge; University of California (San Francisco, California): Roland G. Henry, Nico Papinutto, William Stern, Alyssa Zhu; University of Pennsylvania (Philadelphia, Pennsylvania): Christos Davatzikos, Jimit Doshi, Guray Erus, Kristin Linn, Russell Shinohara; University of Toronto (Toronto, Ontario, Canada): Jiwon Oh; Yale University (New Haven, Connecticut): R. Todd Constable, Daniel Pelletier.
Footnotes
Disclosures: Russell T. Shinohara—RELATED: Grant: National Institutes of Health*; Support for Travel to Meetings for the Study or Other Purposes: Race to Erase MS, Comments: travel to consortium meetings; UNRELATED: Board Membership: Genentech, Comments: Scientific Advisory Board; Consultancy: Hoffmann-La Roche, Comments: expert legal consulting; Grants/Grants Pending: Gates Foundation*; Travel/Accommodations/Meeting Expenses Unrelated to Activities Listed: Government of Canada–Banff Research Institute–European Committee for Treatment and Research in Multiple Sclerosis, Comments: conference travel.* Jiwon Oh—UNRELATED: Consultancy: Consortium of Multiple Sclerosis Centers, EMD Serono, Novartis, Hoffmann-La Roche, Biogen Idec, Teva Pharmaceuticals; Grants/Grants Pending: MS Society of Canada, National MS Society, Biogen Idec, Genzyme*; Support for Travel to Meetings for the Study or Other Purposes: Consortium of Multiple Sclerosis Centers. Peter Calabresi—RELATED: Grant: Race to Erase MS, Comments: foundation grant*; Support for Travel to Meetings for the Study or Other Purposes: Race to Erase MS, Comments: The foundation pays for my travel to semiannual meetings; UNRELATED: Consultancy: Biogen Idec, Vertex Pharmaceuticals; Grants/Grants Pending: Biogen Idec, Teva Pharmaceuticals, Annexon Biosciences, Novartis, Medimmune*; Royalties: Cambridge Press, Comments: for editing a book on optical coherence tomography. Christos Davatzikos—RELATED: Grant: National Institutes of Health/National Institute on Aging computational neuroanatomy of aging and Alzheimer disease via pattern analysis, Comments: R01-AG014971.* Roland G. Henry—RELATED: Grant: Race to Erase MS, Comments: nominal/standard cost for MRI scans*; UNRELATED: Consultancy: Hoffmann-La Roche, AbbVie, Novartis, Genzyme, StemCells Inc*; Grants/Grants Pending: Hoffmann-La Roche*; Payment for Lectures Including Service on Speakers Bureaus: Genzyme.* Daniel Pelletier—UNRELATED: Consultancy: Genzyme, Novartis, EMD-Serono, Genentech; Grants/Grants Pending: Biogen Idec, Comments: investigator-initiated research grant.* Dzung L. Pham—RELATED: Grant: National MS Society, Comments: RG-1507–05243.* Daniel S. Reich—RELATED: Support for Travel to Meetings for the Study or Other Purposes: Race to Erase MS.*William Rooney—RELATED: Grant: Race to Erase MS, Comments: This organization provided pilot funds for the study*; UNRELATED: Employment: Oregon Health & Science University, Comments: employs me as professor/director; Patents (Planned, Pending, or Issued): Oregon Health & Science University, Brookhaven National Laboratory; Royalties: Oregon Health & Science University. Rohit Bakshi—RELATED: Grant: Race to Erase MS.* Nancy L. Sicotte—RELATED: Grant: Race to Erase MS.* *Money paid to the institution.
Major support for this study was provided by the Race to Erase MS. Additional support came from RO1NS085211, R21NS093349, R01EB017255, and S10OD016356 from the National Institutes of Health and RG-1507-05243 from the National Multiple Sclerosis Society. The study was also partially supported by the Intramural Research Program of the National Institute of Neurological Disorders and Stroke.
Paper previously presented in preliminary form at: Annual Meeting of the European Committee on Treatment and Research in Multiple Sclerosis, September 14–17, 2016; London, UK.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Indicates open access to non-subscribers at www.ajnr.org
REFERENCES
- Received February 2, 2017.
- Accepted after revision April 6, 2017.
- © 2017 by American Journal of Neuroradiology