Automated joint skull-stripping and segmentation with Multi-Task U-Net in large mouse brain MRI databases

Skull-stripping and region segmentation are fundamental steps in preclinical magnetic resonance imaging (MRI) studies, and these common procedures are usually performed manually. We present Multi-task U-Net (MU-Net), a convolutional neural network designed to accomplish both tasks simultaneously. MU-Net achieved higher segmentation accuracy than state-of-the-art multi-atlas segmentation methods with an inference time of 0.35 seconds and no pre-processing requirements. We evaluated the performance of our network in the presence of skip connections and recently proposed framing connections, finding the simplest network to be the most effective. We tested MU-Net with an unusually large dataset combining several independent studies consisting of 1,782 mouse brain MRI volumes of both healthy and Huntington animals, and measured average Dice scores of 0.906 (striati), 0.937 (cortex), and 0.978 (brain mask). These high evaluation scores demonstrate that MU-Net is a powerful tool for segmentation and skull-stripping, decreasing inter and intra-rater variability of manual segmentation. The MU-Net code and the trained model are publicly available at https://github.com/Hierakonpolis/MU-Net .

1 Introduction 1 images typically employed in drug development. We tested MU-Net in an independent 66 test set of 1782 MRI volumes acquired over the course of four years from both wild type 67 (WT) and Huntington (HT) C57BL/6J mice, allowing us to evaluate MU-Net in a 68 variety of experimental conditions. We demonstrate that with these data MU-Net 69 achieves a significantly higher accuracy than state-of-the-art multi-atlas segmentation 70 methods [16] with a fraction of a segmentation time (approximately 0.35 s). 71 Additionally, we trained MU-Net for the segmentation of mouse MRI with isotropic 72 voxels into 37 ROIs and demonstrate that the segmentation accuracy of MU-Net was 73 equal or better than a state-of-the-art multi-atlas segmentation method [15]. MU-Net is a U-Net-based network combining region segmentation and skull-stripping to 77 build a shared representation for both tasks, providing at the same time a more efficient 78 feature encoding and a regularizing effect. We present an overview of MU-Net's 79 architecture in Figure 1. Briefly, MU-Net presents an encoder-decoder U-Net-like 80 architecture, with each branch articulated in four dense blocks. Unlike U-Net, the final 81 block of the decoder branch further branches into two different output maps 82 representing our two tasks, sharing the same feature representation. MU-Net was 83 trained using the Adam optimizer [27] to jointly minimize the Dice loss for both 84 skull-stripping and region segmentation [28]. For further details refer to section 4.4.

85
MU-Net was trained and validated on 128 mouse MRI volumes acquired from 32 86 animals at four different ages (5, 12, 16, and 32 weeks). The training set was annotated 87 for the brain mask and four more regions: cortex, striati, hippocampi and ventricles, 88 each category including both left and right regions. For validation, we used 5-folds cross 89 validation and measured overlaps using the Dice score [29]. The hypothesis tests 90 between the Dice coefficients of different segmentation method were performed using 91 permutation tests and all the reported p-values are obtained from a permutation test. 92 Validation and metrics are further expanded upon in section 4.3. 93 Using the training and validation dataset, we compared the performance of different 94 network architectures based on the presence of skip connections, framing connections 95 and 2D or 3D filters. Furthermore, we compared MU-Net with multi-atlas segmentation 96 and evaluated the impact of mouse age on the accuracy of our segmentation maps. 97 Finally, we tested MU-Net on a large dataset including 1782 MRI volumes from 817 98 mice. These volumes were acquired as parts of 10 studies on Huntington's disease and 99 included both WT and HT mice with ages ranging from 4 to 60 weeks. This 100 heterogeneous dataset included three segmented regions, for which we measured average 101 Dice scores of 0.906 (striati), 0.937 (cortex), and 0.978 (brain mask) between MU-Net 102 and manual segmentations. Listed values are the average Dice scores between automatic and manual segmentation in 5-fold CV ± standard deviations of these Dice scores. ROI mean column refers to the mean Dice coefficient of the cortex, the hippocampi, the ventricles and the striati. SC and FC indicate the presence of skip connection and framing connections. MU-Net results are displayed in the first row. STEPS refers to STEPS using randomly selected templates; STEPS* refers to STEPS runs using randomly selecting mice of the same age only; affine indicates that only affine registration was used, whereas diffeo indicates this was followed by a diffeomorphic registration step; Majority voting refers to the selection of the most occuring label after diffeomorphic registration; 3DConv 2DPool: network featuring no in-block skip connections or framing connections, with 3D filtering and 2D pooling in the coronal plane; 2D SLP: 2D network with in-block skip connections and a limited number of parameters; 2D +N3: 2D network trained on data bias-corrected using the N3 algorithm. Boldface characters indicate the best performing network, achieving significantly higher Dice scores than all other networks for that ROI. in the mouse brain (Dice scores from 0.80 to 0.90 [6] The network displaying the highest average Dice scores was, in fact, the simplest one, 117 including no in-block skip connections nor framing connections, and using 2D 118 convolutions. The accuracy of this network was significantly higher than the accuracy of 119 other all other 2D networks (p < 0.00003, all the hypothesis testing in this work was 120 performed by a paired permutation test as detailed in Section 4.3). Because of its 121 excellent performance and simplicity this network is our choice for the MU-Net 122 architecture.

123
The choice between 2D and 3D architectures was the most important factor in 124 increasing performance, resulting in a marked increase in mean Dice scores for both 125 tasks (p < 0.00001) between all 2D networks compared to the 3D ones. We further 126 compared MU-Net with one featuring less channels per filter (49, 49, 50, 50, from the 127 shallowest to the deepest dense block) to match the number of parameters to the number 128 of parameters of the simplest 2D network. We registered a slightly (but not significantly, 129 p = 0.077) lower accuracy compared to MU-Net, indicated as 2D SLP in Table 1.

130
To test whether the increased performance of 2D architectures compared to the 3D 131 implementation depended on the reduced number of parameters or on an excessive loss 132 of information when pooling in the anterior-posterior direction, we trained a network 133 using 3D filters while only pooling in the coronal plane. This network achieved 134 intermediate performance between the 3D and 2D implementations (Table 1), 135 suggesting that both above mentioned aspects were relevant in increasing the 136 algorithm's performance. 137 We trained MU-Net on MR images without bias-correction and on bias-corrected 138 MR images using the N3 method [31]. The validation accuracy achieved with bias 139 correction was indistinguishable from the accuracy of MU-Net trained without bias 140 July 23, 2020 5/22 correction (see Table 1).

141
2.3 Age stratified training sets 142 We evaluated the performance of MU-Net when restricting the training set to mice of a 143 specific age. Networks trained on data from mice of 12, 16 and 32 weeks achieved higher 144 accuracy, both on their respective validation set and the overall ground truth, compared 145 to the networks trained on 5 weeks mice (p < 0.00001). As shown in Fig. 5, while all 146 networks trained on one specific age displayed a statistically significant (p < 0.05, 147 unpaired) decrease in mean accuracy when validated on animals of a different age, this 148 difference was highest between the 5 weeks data and the other datasets.

149
Limiting the training data to one specific age implies that these networks were 150 trained only on a quarter of the data used to train the networks in section 2.2.

151
Irrespective of that, these networks still achieved average Dice score on the mixed-age 152 validation dataset comparable with the accuracy of manual segmentation. improve on the segmentation based on the quality of the registration itself. We repeated 165 this procedure using both diffeomorphic and affine registration methods, with 166 randomly-selected templates restricted to same-age mice. The brain mask segmentation 167 was not evaluated as the manually drawn mask was used during the diffeomorphic 168 registration procedure.

169
MU-Net achieved higher accuracy than all STEPS implementations (p < 0.00001). 170 There was a marked qualitative difference between STEPS segmentation and MU-Net 171 (Fig. 2), the latter achieving results visually indistinguishable from manual 172 segmentation. . Implementing STEPS segmentation using only 177 templates of the same age led to a small but significant (p < 0.0007) performance 178 improvement over randomly choosing templates of any age. The employment of 179 diffeomorphic registration was the most important factor affecting the performance of 180 STEPS, as displayed in Table 1. A simple majority voting strategy led to significantly 181 lower performance in all ROIs compared to all other label fusion strategies (p < 0.003). 182 Furthermore, we trained MU-Net on the outputs of the implemented STEPS 183 procedures and measured the Dice scores of each network's output with the ground 184 truth ( Table 2). As evidenced in Tables 1 and 2    We trained and evaluated MU-Net on the MRM NeAt datasets that includes atlases of 192 10 individual T 2 * -weighted in vivo brain MR images of 12-14 weeks old C57BL/6J mice; 193 each with 37 manually labelled anatomical structures [33]. This same database was 194 selected by Ma et al. [15] to evaluate the STEPS multi-atlas segmentation algorithm on 195 mouse brain MRI. To compare MU-Net with STEPS, we followed the STEPS 196 implementation by Ma et al. [15] as released by the authors.

197
We used a 5-fold cross validation scheme for evaluation (8 templates for training and 198 2 templates for testing). As displayed in Fig. 4, the accuracy of MU-Net was 199 comparable or better to STEPS: while in a majority of regions MU-Net's accuracy was 200 higher than the accuracy of STEPS, this was statistically significant only for the brain 201 mask, external capsule, hypothalamus and brain stem. In the left inferior colliculi,  The MU-Net model trained on a train and validation dataset was tested on a large test 210 set of 1782 MRI volumes, acquired from 817 mice with ages ranging from 4 to 60 weeks, 211 and including both WT and HT mice. As the 5-fold cross-validation experiment 212 produced five different MU-Net models, the segmentation maps for the test set were 213 obtained by averaging the prediction maps produced by each of the five models.

214
Out of the entire test set, segmentation failed completely on two volumes, where no 215 brain mask was detected at all. The remaining 1780 volumes were successfully 216 segmented with an average Dice score of 0.978 ± 0.012 for the brain mask, 0.906 ± 217 0.041 for the striati, and 0.937 ± 0.035 for the cortex, distributed as illustrated in Fig. 218 7. There was no significant difference between the segmentation accuracy of male and 219 female animals (p > 0.1, unpaired). However, there was a significant difference in 220 accuracy between HT and WT mice (p < 0.00001, unpaired) for all ROIs. Dice scores of 221 WT animals were 0.4% higher for the brain mask, 1.7% higher for the cortex, and 1.9% 222 higher for the striati. Applying N3 bias correction on all volumes before segmentation 223 did not result in a significant Dice score difference. A detailed list of the Dice scores for 224 each animal and each ROI is available as an supplementary csv-file.      MR images with relatively long acquisition times, favored in basic research. 265 We observed that the accuracy of atlas-based methods can vary markedly based on 266 the specific use case depending on the number of outlined ROIs, voxel-size, and image 267 quality. The best performance was achieved using advanced atlas-based methods [15] on 268 the high resolution data [33] with a densely labeled atlas of 37 ROIs, and the lowest    [36] in the convolution blocks of U-Net while keeping 286 the number of output channels constant; Han and Ye [30] proposed two variants based 287 on signal processing arguments for the reduction of artifacts in a sparse image 288 reconstruction task. We, however, found that a more complex model did not improve 289 and in fact lowered the accuracy of our results, perhaps given the simplicity of the task. 290 Thus, in agreement with Isensee et al. [37], we found that a 2D approach was preferable 291 to 3D approach in the presence of anisotropic voxels. We also found the Dice loss to be 292 sufficient to effectively train our model without the addition of a cross-entropy loss.

293
Much like the human eye, MU-Net was not significantly affected by the presence of 294 the bias field, and did not benefit from N3 bias correction. Correcting for the bias field 295 might still be beneficial as it depends on the specific experimental setup, and thus N3 296 bias correction might avoid specializing the network to one particular acquisition 297 procedure. For this reason, we release the trained parameters of the model for MU-Net 298 trained on both the non-corrected and the N3-corrected data.

299
To ensure the network generalizes to a wide age range, our results indicate that the 300 distinctive features present before adulthood need to be adequately represented in the 301 training data. This is evidenced by the degraded performance observed when testing 302 networks trained on 5-week old mice on the volumes acquired from older ones, and 303 vice-versa. As mice are typically weaned at 3-4 weeks and attain sexual maturity at 304 8-12 weeks [38], 5-week old mice are not adults. In contrast, training solely on male 305 mice did not significantly influence MU-Net's performance on female animals.

306
An obvious limitation of our approach is its specialization for the specific MRI 307 contrast the algorithm is trained on, in this case T 2 MRI. Expanding our approach to a 308 different contrast would require expanding the training set, image translation or transfer 309 learning. Even considering our limitations, MU-Net successfully generalized to a variety 310 of transgenic mice in an age range wider than that of the training set.

311
The employment of CNNs for the segmentation of mouse brain MRI provides a 312 number of benefits for preclinical researchers. Beyond allowing for the employment of 313 large datasets in a time-efficient manner, the ability to generalize and abstract from the 314 training data results in more robust and reproducible predictions. We can thus expect 315 these methods to reduce the confounding effect of intra-and inter-rater variability  The test set animals were part of 10 studies scanned at a single or multiple ages 327 from 4 up to 60 weeks, and included both WT and several HT genotypes: R6/2, Q175, 328 Q175DN, Q111, Q50 and Q20 (Supplementary Table S1), for a total of 1,782 MRI scans. 329 The groups included both males and females. These volumes were acquired as part of 330 ten studies of Huntington's disease, kindly provided by the CHDI 'Cure Huntington's 331 Disease Initiative' foundation.

332
All mice were housed in groups of up to 4 per cage (single sex) in a temperature 333 (22±1°C) and humidity (30-70%) controlled environment with a normal light-dark cycle 334 (7:00-20:00). All animal experiments were carried out according to the United States 335 National Institute of Health (NIH) guidelines for the care and use of laboratory animals, 336 and approved by the National Animal Experiment Board. Gating System, Small Animal Instruments, Inc., Stony Brook, NY. The temperature was 343 maintained at ∼ 37°C using Small Animal Instruments feedback water heating system. 344 All acquisitions were performed using a horizontal 11.7 T magnet with a bore size of 345 160 mm, equipped with a gradient set capable of maximum gradient strength of  Volumes within each study were manually segmented by an experienced rater, who 355 had received a training and passed the qualification tests according to SOP (Standard 356 Operating Procedure) for volumetric analysis in mice. Different studies were analyzed 357 by different raters. Each training volume was manually segmented by drawing the brain 358 mask and delineating 4 regions of interest: cortex, hippocampi, striati and ventricles.

359
The brain mask did not include the olfactory bulb or the cerebellum. For the test set, 360 only 3 regions were manually labeled: brain mask, cortex and striati.  Figure 4) in addition to the brain mask [33]. This dataset was 365 downloaded from https://github.com/dancebean/mouse-brain-atlas, where an 366 improved atlas is available (bias correction has been applied, left and right labels have 367 been separated and 4th ventricle label added). This dataset was used to evaluate the 368 STEPS algorithm by Ma et al. [15] and is used here for the purpose of comparing 369 MU-Net and STEPS on a larger number of ROIs on isotropic resolution MRI. As 370 detailed in [33], T 2 -weighted MR data with a voxel-size of 0.1 mm 3 requiring about 2.8 371 hours of scan time were acquired with a 3D large flip angle spin echo sequence using a 372 super-conducting 9.4T/210 mm horizontal bore magnet (Magnex) controlled by an 373 ADVANCE console (Bruker) and equipped with an actively shielded 11.6 cm gradient 374 set (Bruker, Billerica, MA). To assess the overlap between the ground truth and the predicted segmentation masks, 377 we used the Dice coefficient [29]. The Dice coefficient is defined as two times the size of 378 the intersection over the sum of the sizes of the two regions: This coefficient ranges from 0, meaning no overlap, to 1, indicating a complete overlap 380 between the two regions. A visual rendition for the intuition behind the Dice score is 381 provided in the supplementary Figure S1.

402
We segmented mouse brain MRI volumes into five regions (cortex, hippocampi, 403 ventricles, striati and background) simultaneously generating brain mask binary 404 segmentation via a multi-task learning approach. We refer to the first task as the region 405 segmentation and to the latter task as skull-stripping. We implemented and compared 406 different variants of the U-Net architecture. dense layer in the encoding path with the respective un-pooled feature map of the same 420 size before feeding it as input to the decoding dense block.

421
The output of the last decoding layer acts as the input of two different classification 422 layers, which share the same feature representation up to this point: a 1x1 single 423 channel convolution with a sigmoid activation function, and a 1x1 5 channels layer 424 followed by a softmax activation function, for the skull-stripping task and the region 425 classification task, respectively.  For these reasons, we confronted 2D and 3D implementations of our network, using 453 5x5x5 filters and 2x2x2 max-pooling layers, replacing the filters and pooling layers  Recent literature suggests that Dice-based loss functions [19,28,34] would constitute an 460 improvement over cross-entropy losses for the segmentation of medical images [42]. We 461 optimized a joint loss function L, that is the sum of two Dice loss functions 462 corresponding to the the skull-stripping (L SS ) and the region classification task (L RS ). 463 Let p(i) be the predicted probability of voxel i of belonging to the brain mask, and g(i) 464 the ground truth for voxel i (g(i) = 1 if the voxel is in the brain mask). Further, let 465 p l (i) and g l (i) be the same quantities for label l (l = 1, 2, 3, 4, 5 referring to cortex, , In this formulation each task (skull striping and region segmentation) is defined 469 independently from the other. The networks were implemented using the PyTorch framework and trained with 472 stochastic gradient descent using Adam optimizer [27]  normalized to 0 mean and unit variance.

483
Networks used for the comparison on the data released by Ma et al. [15] were trained 484 for up to 24 hours on one Nvidia Volta V100 each, provided by the CSC -IT Center for 485 Science, Finland.  Before registration, each volume underwent non-parametric N3 bias field correction [31] 497 implemented within the ANTS toolset [43]. Taking each volume as reference, all other 498 volumes were then registered with an affine transformation using FSL FLIRT [44] using 499 a correlation ratio cost function, and then nonlinearly registered via FSL FNIRT [45,46] 500 with the aid of the manually drawn brain mask. Label fusion is achieved with the 501 STEPS algorithm distributed in the NiftySeg package [16,32].

502
STEPS depends on the number of templates employed and the standard deviation of 503 its Gaussian kernel. We performed a grid search to select the optimal parameters, 504 randomly selecting 10 volumes and labeling them using STEPS. We sampled the 505 standard deviation of the Gaussian kernels between 0.5 and 6 with a stride of 0.5, and 506 the number of templates ranged between 1 and 20 randomly selected volumes. This 507 same process has been performed both using diffeomorphic registration and using affine 508 registration only (supplementary Figures 2 and 3), selecting 16 templates and kernel 509 standard deviation of 1.5 for the diffeomorphic case, and 18 templates with kernel 510 standard deviation of 2.5 for the affinely registered volumes. Exploring both grids 511 required in total 287 hours.

512
Each volume was then segmented using these parameters, randomly selecting an 513 appropriate number of mice as templates for the STEPS algorithm as emerged from the 514 parameter grid search outlined above. We repeated this procedure randomly selecting 515 the same number of templates from mice of the same age only. The mice randomly 516 selected as reference atlases were selected from the training set associated to each 517 volume according to the same 5-fold cross validation scheme used to train the CNNs as 518 outlined in section 4.3.

519
When evaluating STEPS on MRM NeAt dataset, we used scripts provided by Ma et 520 al. [15] at https://github.com/dancebean/multi-atlas-segmentation as this 521 implementation is optimized using this dataset. The only post-processing steps applied on the segmentation maps were the filling of 524 holes in the resulting 3D volume, the selection of the largest connected component as 525 the brain mask for the skull-stripping task, and assigning all voxels predicted as 526 non-brain to the background class. validation dataset is property of Charles River Discovery Services, and the test dataset 533 is property of CHDI 'Cure Huntington's Disease Initiative' foundation. The MRM NeAt 534 dataset is freely available at https://github.com/dancebean/mouse-brain-atlas. 535 All the Dice scores between MU-Net and manual segmentations are available as 536 supplementary files to this manuscript.