Contextual Encoder-Decoder Network for Visual Saliency Prediction

Alexander Kroner; Mario Senden; Kurt Driessens; Rainer Goebel

doi:10.48550/arXiv.1902.06634

Contextual Encoder-Decoder Network for Visual Saliency Prediction

Alexander Kroner^*, Mario Senden, Kurt Driessens, Rainer Goebel

^*Corresponding author for this work

Research output: Working paper / Preprint › Preprint

45 Downloads (Pure)

Abstract

Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive results on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on selected examples. The network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources to estimate human fixations across complex natural scenes.

Original language	English
Publisher	Cornell University - arXiv
Pages	261-270
Number of pages	10
Volume	129
DOIs	https://doi.org/10.48550/arXiv.1902.06634
Publication status	Published - Sept 2020

Publication series

Series	arXiv.org
ISSN	2331-8422

Keywords

cs.CV
Deep learning
Computer vision
ATTENTION
Human fixations
INTEGRATION
INFORMATION
Saliency prediction
Convolutional neural networks

Access to Document

10.48550/arXiv.1902.06634

https://arxiv.org/abs/1902.06634Licence: Unspecified

1 Article

Contextual encoder-decoder network for visual saliency prediction
Kroner, A., Senden, M., Driessens, K. & Goebel, R., Sept 2020, In: Neural Networks. 129, p. 261-270 10 p.
Research output: Contribution to journal › Article › Academic › peer-review

Open Access

Cite this

@techreport{d5c32c3f38ec426e9cb845e284178e24,

title = "Contextual Encoder-Decoder Network for Visual Saliency Prediction",

abstract = " Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive results on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on selected examples. The network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources to estimate human fixations across complex natural scenes. ",

keywords = "cs.CV, Deep learning, Computer vision, ATTENTION, Human fixations, INTEGRATION, INFORMATION, Saliency prediction, Convolutional neural networks",

author = "Alexander Kroner and Mario Senden and Kurt Driessens and Rainer Goebel",

year = "2020",

month = sep,

doi = "10.48550/arXiv.1902.06634",

language = "English",

volume = "129",

series = "arXiv.org",

pages = "261--270",

publisher = "Cornell University - arXiv",

address = "United States",

type = "WorkingPaper",

institution = "Cornell University - arXiv",

}

TY - UNPB

T1 - Contextual Encoder-Decoder Network for Visual Saliency Prediction

AU - Kroner, Alexander

AU - Senden, Mario

AU - Driessens, Kurt

AU - Goebel, Rainer

PY - 2020/9

Y1 - 2020/9

N2 - Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive results on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on selected examples. The network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources to estimate human fixations across complex natural scenes.

AB - Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive results on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on selected examples. The network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources to estimate human fixations across complex natural scenes.

KW - cs.CV

KW - Deep learning

KW - Computer vision

KW - ATTENTION

KW - Human fixations

KW - INTEGRATION

KW - INFORMATION

KW - Saliency prediction

KW - Convolutional neural networks

U2 - 10.48550/arXiv.1902.06634

DO - 10.48550/arXiv.1902.06634

M3 - Preprint

VL - 129

T3 - arXiv.org

SP - 261

EP - 270

BT - Contextual Encoder-Decoder Network for Visual Saliency Prediction

PB - Cornell University - arXiv

ER -

Contextual Encoder-Decoder Network for Visual Saliency Prediction

Abstract

Publication series

Keywords

Access to Document

Research output

Contextual encoder-decoder network for visual saliency prediction

Cite this