GastroView - angiodysplasia detection and localization
September, 2017 [authors: Adam Brzeski, Jan Cychnerski, CTA.ai, Gdańsk University of Technology]
Angiodysplasia detection algorithm in the GastroView system uses a convolutional neural network as a base classification model. The network is trained on image patches instead of entire images in order to operate in sliding-window mode in test time. The outputs are then aggregated and postprocessed in order to provide detection (full frame classification) and localization outputs.

Base classification model outline

  • Google MobileNet architecture
  • 6-fold cross validation
  • Trained on image patches of size 0.1, 0.2, 0.33 and 0.5

Classification model uses MobileNet network architecture, featuring good tradeoff between accuracy and computational cost [1]. In order to address the relatively low training data size, training of the base classification model is performed on image patches cropped from the original images, which significantly increases the amount of training samples (at the cost of losing some context information, however). Four rectangular patch sizes were used of following ratios: 0.1, 0,2, 0.33, 0.5. During training process, for each input image, several random patches are selected and put into the input batch. Also, batches are balanced in a positive-negative ratio of 1:2.


  • Transforation: rotation, flip, skew, perspective, blur, noise, hue distortion ,saturation distortion, PCA color distortion

In addition, input images are subject to random augmentation transforms, including rotation, flip, skew and perspective transforms, blur, noise as well as color variance transforms: hue distortion, saturation distortion and PCA color augmentation [2].

Hand-crafted features

As an additional mean for addressing low training data size, a concept of dedicated feature maps is introduced. For this purpose, seven high-level features of angiodysplasia were identified and implemented in a form of simple image processing algorithms. Each of the features produces an activation map. The resulting 7 feature maps are appended to regular B, G, R channels of the original image, forming a 10-channel image, which is passed as input to the neural network, with feature maps suppressed with 0.1 factor in order to keep network focus on the original channels. The features are based on the previous work [3] and they include 3 features detecting areas of high color similarity to angiodysplasia regions (based on smoothed histograms), including: color similarity, area smoothness and contour clearness, similar set of 3 features for regions of moderate aniogdysplasia color similarity, and lastly, a feature describing domination of red color.

After initial training of the classifier, hard example mining was applied, in which 30% of all training samples that produced worst classification results were selected and inserted into hard examples set. In the final training, training samples were picked in 1:1 ratio from the original training set and from the hard examples set, resulting in approximately 4 times amplification of hard examples in the training process.

The final base classifier is an ensemble of 6 top performing models, each acquired from a different split of 6-fold cross-validation applied over the training set.

Detection & localization

In order to perform classification of a full image frame, a sliding-window detector using the base model is applied over the image, using patch sizes of {0.1, 0.2, 0.33, 0.5} and respective overlaps {0.5, 0.5, 0.5, 0.25}. For each patch size a separate threshold value was defined. The final detector returns a positive output, if at least one base model output on any image patch exceeds the threshold respective for its size. The four thresholds were automatically optimized using random search to minimize full frame classification error, evaluated using cross-validation over the training set.

The outputs of the base model collected over the image during the sliding-window process are converted into an activation map. To achieve this, each positive output from the detection is added to the activation map, while negative outputs are subtracted. In order to smooth the activation map, scores for patches are represented as circles instead of rectangles and blur is applied, which is followed by normalization of the map. The base model activation map is then multiplied by the fourth of the introduced feature maps, denoting moderate color values similarity to angiodysplasia regions, which results in the final activation map.

Next, for each patch size a new threshold value is defined. The final localization algorithm returns one angiodysplasia location point for each base model output that exceeds the threshold respective for its size, and the selected point is the location of maximum value of the activation map in the area of considered patch. Similarly as detection thresholds, localization thresholds are automatically optimized using cross-validation and random search to minimize localization error over the training set.


[1] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[3] Brzeski, Adam. "Visual Features for Endoscopic Bleeding Detection." British Journal of Applied Science & Technology 4.27 (2014): 3902.