Content-Conditioned Style Encoder

The Content-Conditioned Style Encoder, also known as COCO, is a type of encoder used for image-to-image translation in the COCO-FUNIT architecture.

What is COCO?

COCO is a style encoder that differs from the traditional style encoder used in FUNIT. COCO takes both content and style images as input, allowing for a direct feedback path during learning. This feedback path enables the content image to influence how the style code is computed, which in turn reduces the direct influence of the style image on the extracted style code.

How does COCO work?

The COCO architecture consists of an encoder that computes a spatial feature map from the content image. This content feature map is then mean-pooled and mapped to a vector $\zeta_{c}$. The style image is also fed into an encoder to compute a spatial feature map, which is then mean-pooled and concatenated with a constant style bias vector. This concatenated vector is then mapped to a vector $\zeta_{s}$ via a fully connected layer.

Next, an element-wise product operation is performed between the vectors $\zeta_{c}$ and $\zeta_{s}$, resulting in the final style code. This style code heavily influences the content image, creating a customized style code for the input content image.

How is COCO used in FUNIT?

COCO is used as a drop-in replacement for the traditional style encoder in FUNIT. The translation output is computed using the COCO mapping, with the resulting style code being more robust to variations in the style image.

Overall, COCO is an architecture that allows for more personalized image-to-image translation by taking into account both content and style images during the training process.