Evaluation on secret-test dataset and network's output
Additional, secret-test dataset
We will keep a part of the provided datasets secret. This part will be used to provide an objective evaluation of the models’ performance. This evaluation will be performed by running the evaluation.py script for the model's predictions on the secret-test dataset. The evaluation.py script generates a final quantitative score that will be used to rank all of the submitted projects. However, this score will not influence the decision of whether the project is passed or failed (as long as the model produces something reasonable).
The evaluation of the secret-test dataset will be performed by us (including implementation of the evaluation metrics in evaluation.py). The students are only required to ensure that their network output is in the correct format. It is not necessary to perform numerical evaluation, as the final scores will be generated using the secret-test dataset. Of course, it is recommended to use the evaluation scripts to ensure that the models work correctly.
The evaluation script can be downloaded below. Since this implementation has not been used before, small bugs and unintended behaviour are possible. Though all of the functions were tested extensively to ensure correctness and robustness, we cannot guarantee that there are no errors. If you notice some unexpected or incorrect results, we ask that you report it to a tutor. If it is a major problem, then it will be fixed and an updated version will be issued. In other cases, the implementation may be left as-is, since all students are going to ranked using the same script.
Quantitative metrics and the network's output
We will use a set of quantitative evaluation metrics to build the final ranking on the secret-test dataset. All metrics are implemented in the evaluation.py file.
DET:
[network output-> a set of .csv files (one for each image) with the following columns: 'filename', 'xmin', 'xmax', 'ymin',
'ymax', and 'confidence']
Metric: Average precision score (Based on the interpolated precision-recall curve.)
CLA:
[network output-> a single .csv file with 2 columns: 'filename' and 'class_id']
Metric: F1-Score(predicted_classes_list, reference_classes_list)
SEG:
[network output-> rgb mask]
Metric: Class-wise IoU for masks + weighted mean as a final score.
COL:
[network output-> rgb image]
Metric: average DSSIM(img1, img2)=1-SSIM(img1, img2)