72 Hours Part2: Fish Detection & Classification Solution

7 min readFeb 16, 2021

This is the second instalment to our 2-part blog based on our approach of the Nvidia GPU Hackathon and NOAA Fisheries. Part 1 of the blog described and focused on general introduction and the data used for the hackathon, this part focuses on the algorithms used and the results obtained.

Object detection

Our partner in the hackathon, AI.Fish, were in charge of creating the object detector. AI.Fish decided to use a Faster-RCNN model pretrained on the COCO dataset.

15,000 bounding box annotations over 1,733 images were used by Ai.Fish to finetune the model and create a fish detection system. The object detector was then used to infer on new images from the GroundFish dataset. The image chips were created by cropping the area within the inferred region (examples of this are shown below). Roughly 14,000 image chips were created with this process, with multiple chips coming from the same image, which the team at Lynker Analytics used in the active learning process.

Several images used in the building of the detection and classification system

Active Learning — Classification

The Active Learning process began by training a weak inception v3 model on 200 images from our 11 classes. This weak model was then used to infer on the 14,000 image chips. By using Max Entropy and some random sampling images are chosen to be shown to a human reviewer through the active learning tool. An example of the active learning tool is shown below.

In the example of the active learning tool an image of a rockfish is shown, below the image is the model’s predicted class and the confidence of the prediction. On the right is the class menu, from here we could select the correct class (which is also rockfish in this case).

The unclassified_remove class was reserved for species we were not included in our class list at the time of the hackathon.

If an image contained more than a single sea-animal then it would be discarded to avoid training the model on image clips with multiple classes. If an error was made the undo previous option could be used to drop the last correction from the dataset.

The active learning stage was an iterative process and was run several times over the hackathon, with the highest entropy samples shown in the active learning tool being updated each time the inception model was retrained.

Final Image Classification

Once a sizeable dataset had been created through the active learning process we had around 5000 images with the highest information gain. An EfficientNet B4 model was then trained on this data.

Though training an EfficientNet model was not essential to complete the project as EfficientNet is a higher performance model than InceptionV3 and it was a relatively easy step improve our final classification accuracy.

Results

The results presented below are from the active learning system and the classification system developed by the Lynker Analytics team. We will discuss four measures for understanding the performance of the systems: accuracy, precision, recall and a confusion matrix to understand which classes performed well and which classes did not.

Before we get into the results here is a quick refresher on accuracy, precision & recall.

What is Accuracy?

The simplest explanation of accuracy would be the fraction of predictions our model got right.

What is Precision?

Precision is a measure of how many of the predicted positive class was actually correct.

If our model predicted that were 100 flatfish but there were actually only 80, our precision would be 80%.

(the 80 the model got correct)/(the 80 the model got correct + the 20 the model wrongly predicted as flatfish)

What is Recall?

Recall is a measure for how many of the actual positive was predicted correctly

If there were 100 round fish but our model correctly classified 50 then our recall would be 50%.

(50 it predicted correctly)/(the 50 it predicted correctly + the 50 it missed)

Active Learning System Statistics

Evaluation metrics for the active learning system

The accuracy of the active learning system model which used Inception V3 yielded an accuracy of 78% with the weighted average for both precision and recall also reaching 78%.

For the Skate class we can deduce that from a low recall but a high precision we are only finding 56% of skates (recall) but out of everything classified as skates by the model we were correct 83% (precision) of the time.

For the Urchin class we had a precision of 100% meaning there were no false positives for this class i.e. everything the model predicted as an Urchin was correct. Suggesting that this class is easily differentiated from the other classes, which is true for the most part as it is very easy to tell the difference between an Urchin and a fish.

However, the recall value for the Urchin class was 92%, meaning 8% of actual Urchins were misclassified, using a confusion matrix we can further look into this issue.

Confusion matrix for active learning system

The confusion matrix for the active learning system helps us understand that the model has performed excellently when it comes to certain species which are easily identifiable and have distinct features.

Urchins and Sponges are the best performing classes because the model saw a total of 50 urchins and misclassified only 4 of them (as flatfish, Rockfish, and sponge) and saw a total of 39 sponges and misclassified only 10 of the total sponges (largely as the invertebrate class).

Clipped object detection inferences of Starfish (top) and sponge (bottom)

The model performed poorly and struggled when it came to Shortspine Thornyheads as it of mistook these for rockfish, the cause of this problem was rather obvious as both the species have similar characteristics.

The model saw 79 Shortspine Thornyhead images and only classified 40 of these correctly and misclassified 37 of those as rockfish,

Clipped object detection inferences of Rockfish (top) and Shortspine Thornyhead (bottom)

Classification System Statistics

Evaluation metrics for classification system

The accuracy of the final classification model which used EfficientNet-B4 yielded an accuracy of 86% which was more than the active learning system which used Inception, highlighting the benefit of using the latest state-of-art model. The same dataset yielded an 8% improvement using the EfficientNet model architecture.

The precision and recall per class can be seen in the image above and we can clearly identify that Flatfish were correctly predicted 91% of the time and that the model correctly identified 95% of all the images of Flatfish. Whereas the Shortspine Thornyhead class was also correctly predicted 91% of the time but only 49% of all the total images of Shortspine Thornyheads were correctly identified by the model.

The weighted average precision & recall rose from 78% on the Inception model to 91% and 86% respectively with EfficientNet.

Confusion matrix for classification system on holdout dataset

The confusion matrix helps us understand that the best performing class is the flatfish because the model saw a total of 123 flatfish and 117 of them were correctly classified by the model.

Clipped object detection inferences of flatfish

This is a good classification result for this specific species because as seen from the above images the flatfish have distinct features and are easily identifiable.

The worst performing class was the Shortspine Thornyhead because out of the 41 images seen by the model only 20 of them were correctly identified as Shortspine Thornyhead and 21 of them were identified as rock fish.

Conclusion

Overall, the experience gained for both myself and the team was amazing. From collaborating with new team members across the globe and dealing with various time zone issues to working on a large project with such a small timeframe, it was undoubtedly a new type of challenge.

From both the hackathon and the meet-up presentation that followed, I was able to develop many skills such as my technical knowledge and my confidence in presenting and communicating to a large audience.

The results obtained are impressive bearing in mind the time scale and lack of human annotated data. The use of an active learning system helped build a definitive dataset with annotations over a very small time-frame, and the use of state of the art model architectures was able to boost the performance of the data to its current limits.

Finally, I would like to thank Lynker Analytics, NOAA Fisheries and Nvidia for giving us the opportunity to partake in such an interesting challenge.