The Learn2Reg challenge has an automatic evaluation system for validation scans running on grand-challenge.org. You can submit your deformation fields as zip file at the Create Challenge Submission page and results for each task will be published on the validation leaderboard (note that this does not reflect the final ranking as test scans are different and ranks will be computed based on significance, weighted scores, etc.). Docker submissions have to be sent as download links to email@example.com. Test set deformation fields can also be sent as download links via mail or as submission to the test leaderboard (note that no results will be published before the challenge deadlines).
Submissions must be uploaded as zip file containing displacement fields (displacements only, identity grid is added) for all validation pairs for all tasks (even when only participating in a subset of the tasks, in that case submit deformation fields of zeroes for all remaining tasks). You can find the validation pairs for each task as CSV files at the Datasets page. The convention used for displacement fields depends on scipy's map_coordinates() function, thus expecting displacement fields in the format [[x, y, z], X, Y, Z], where x, y, z and X, Y, Z represent voxel displacements and image dimensions, respectively. The evaluation script expects .npz files using half-precision format ('float16') and having shapes 3x128x128x144 for task 1 (half resolution), 3x96x96x104 for task 2 (half resolution), 3x96x80x128 for task 3 (half resolution) and 3x64x64x64 (full resolution) respectively. The file structure of your submission should look as follows:
The first four digits represent the case id of the fixed image (as specified in the corresponding pairs_val.csv) with leading zeros, the second four digits represent the case id of the moving image. For the paired registration tasks the fixed and moving image are defined as MR and US (task 1) and exhale and inhale scan (task 2) respectively. Note that in conventional lung registration tasks the exhale scan is registered to the inhale scan. However, in this dataset the field-of-view for the exhale scan is partially cropped which leads to missing correspondences in the inhale scan. We further provide a python script to create a submission zip file from a folder of uncompressed, full precision (float32) and full resolution (image resolution) deformation fields (same file structure as above): create_submission.py (zero deformation fields output: submission.zip). If you have any problems with your submissions or find errors in the evaluation code (see below), please contact Adrian Dalca, Alessa Hering, Lasse Hansen and Mattias Heinrich at firstname.lastname@example.org.
Note for PyTorch users: When using PyTorch as deep learning framework you are most likely to transform your images with the grid_sample() routine. Please be aware that this function uses a different convention than ours, expecting displacement fields in the format [X, Y, Z, [z, y, x]] and normalized coordinates between -1 and 1. Prior to your submission you should therefore convert your displacement fields to match our convention (see above).
Metrics and Evaluation
Since registration is an ill-posed problem, the following metrics will be used to determine per case ranks between all participants
- TRE: target registration error of landmarks (Tasks 1, 2)
- DSC: dice similarity coefficient of segmentations (Tasks 3, 4)
- DSC30: robustness score (30% lowest DSC of all cases) (Tasks 3, 4)
- HD95: 95% percentile of Hausdorff distance of segmentations (Tasks 3, 4)
- SDlogJ: standard deviation of log Jacobian determinant of the deformation field (Tasks 1, 2, 3, 4)
DSC measures accuracy; HD95 measures reliability; Outliers are penalised with the robustness score (DSC30: 30% of lowest mean DSC); The smoothness of transformations (SD of log Jacobian determinant) are important in registration, see references of Kabus and Low. For final evaluation on test sets all metrics but robustness (DSC30) use mean rank per case (ranks are normalised to between 0.1 and 1, higher being better). For multi-label tasks the ranks are computed per structure and later averaged. As done in the Medical Segmentation Decathlon we will employ "significant ranks" http://medicaldecathlon.com/files/MSD-Ranking- scheme.pdf. Across all metrics an overall score is aggregated using the geometric mean. This encourages consistency across criteria. Missing results will be awarded the lowest rank (potentially shared and averaged across teams). For further insights into the used metrics and evaluation routines we provide the evaluation script that is running on the automatic evaluation system: evaluation.py
- AD Leow, et al.: "Statistical properties of Jacobian maps and the realization of unbiased large-deformation nonlinear image registration" TMI 2007
- S Kabus, et al.: "Evaluation of 4D-CT Lung Registration" MICCAI 2009