Aerial to Street View Image Translation using Cascaded Conditional GANs

K. Singh; A. Briassouli; M. Popa

doi:10.5220/0010814000003124

Aerial to Street View Image Translation using Cascaded Conditional GANs

K. Singh^*, A. Briassouli, M. Popa

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

Abstract

Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.

Original language	English
Title of host publication	PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4
Editors	GM Farinella, P Radeva, K Bouatouch
Publisher	SCITEPRESS
Pages	372-379
Number of pages	8
ISBN (Print)	9789897585555
DOIs	https://doi.org/10.5220/0010814000003124
Publication status	Published - 2022
Event	17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP) - Online Streaming Duration: 6 Feb 2022 → 8 Feb 2022 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=132548&copyownerid=45217

Publication series

Series	VISIGRAPP. Proceedings
ISSN	2184-4321

Conference

Conference	17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP)
Period	6/02/22 → 8/02/22
Internet address	http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=132548&copyownerid=45217

Keywords

Cross View Image Translation
Conditional GANs
Semantic Segmentation
U-net

Access to Document

10.5220/0010814000003124Licence: CC BY-NC-ND

Cite this

Singh, K., Briassouli, A., & Popa, M. (2022). Aerial to Street View Image Translation using Cascaded Conditional GANs. In GM. Farinella, P. Radeva, & K. Bouatouch (Eds.), PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4 (pp. 372-379). SCITEPRESS. https://doi.org/10.5220/0010814000003124

@inproceedings{c1f608f7f9ac49e1a2919bd97e0a9ad5,

title = "Aerial to Street View Image Translation using Cascaded Conditional GANs",

abstract = "Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.",

keywords = "Cross View Image Translation, Conditional GANs, Semantic Segmentation, U-net",

author = "K. Singh and A. Briassouli and M. Popa",

year = "2022",

doi = "10.5220/0010814000003124",

language = "English",

isbn = "9789897585555",

series = "VISIGRAPP. Proceedings",

publisher = "SCITEPRESS",

pages = "372--379",

editor = "GM Farinella and P Radeva and K Bouatouch",

booktitle = "PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4",

note = "17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP) ; Conference date: 06-02-2022 Through 08-02-2022",

url = "http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=132548&copyownerid=45217",

}

Singh, K, Briassouli, A & Popa, M 2022, Aerial to Street View Image Translation using Cascaded Conditional GANs. in GM Farinella, P Radeva & K Bouatouch (eds), PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4. SCITEPRESS, VISIGRAPP. Proceedings, pp. 372-379, 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP), 6/02/22. https://doi.org/10.5220/0010814000003124

Aerial to Street View Image Translation using Cascaded Conditional GANs. / Singh, K.; Briassouli, A.; Popa, M.
PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4. ed. / GM Farinella; P Radeva; K Bouatouch. SCITEPRESS, 2022. p. 372-379 (VISIGRAPP. Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Aerial to Street View Image Translation using Cascaded Conditional GANs

AU - Singh, K.

AU - Briassouli, A.

AU - Popa, M.

PY - 2022

Y1 - 2022

N2 - Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.

AB - Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.

KW - Cross View Image Translation

KW - Conditional GANs

KW - Semantic Segmentation

KW - U-net

U2 - 10.5220/0010814000003124

DO - 10.5220/0010814000003124

M3 - Conference article in proceeding

SN - 9789897585555

T3 - VISIGRAPP. Proceedings

SP - 372

EP - 379

BT - PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4

A2 - Farinella, GM

A2 - Radeva, P

A2 - Bouatouch, K

PB - SCITEPRESS

T2 - 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP)

Y2 - 6 February 2022 through 8 February 2022

ER -

Singh K, Briassouli A, Popa M. Aerial to Street View Image Translation using Cascaded Conditional GANs. In Farinella GM, Radeva P, Bouatouch K, editors, PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4. SCITEPRESS. 2022. p. 372-379. (VISIGRAPP. Proceedings). doi: 10.5220/0010814000003124