import * as React from 'react'
  /* @jsx mdx */
import { mdx } from '@mdx-js/react';
/* @jsxRuntime classic */

/* @jsx mdx */

import DefaultLayout from "/home/runner/work/coqui-ai.github.io/coqui-ai.github.io/src/templates/BlogTemplate.tsx";
import { graphql } from 'gatsby';
export const pageQuery = graphql`
  query($fileAbsolutePath: String) {
    ...SidebarPageFragment
  }
`;
export const _frontmatter = {};

const makeShortcode = name => function MDXDefaultShortcode(props) {
  console.warn("Component " + name + " was not imported, exported, or provided by MDXProvider as global scope");
  return <div {...props} />;
};

const Link = makeShortcode("Link");
const layoutProps = {
  pageQuery,
  _frontmatter
};
const MDXLayout = DefaultLayout;
export default function MDXContent({
  components,
  ...props
}) {
  return <MDXLayout {...layoutProps} {...props} components={components} mdxType="MDXLayout">



    <p>{`Despite the success of the latest attention based end2end text2speech (TTS)
models, they suffer from attention alignment problems at inference time. They
occur especially with long-text inputs or out-of-domain character sequences.
Here I’d like to propose a novel technique to fight against these alignment
problems which I call Double Decoder Consistency (DDC) (with a limited
creativity). DDC consists of two decoders that learn synchronously with
different reduction factors. We use the level of consistency of these decoders
to attain better attention performance.`}</p>
    <div align="center">
  <iframe width="560" height="315" src="https://www.youtube.com/embed/ADnBCz0Wd1U" frameBorder="0" allow="accelerometer;
        autoplay; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>
    </div>
    <h3 {...{
      "id": "end-to-end-tts-models-with-attention",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#end-to-end-tts-models-with-attention",
        "aria-label": "end to end tts models with attention permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`End-to-End TTS Models with Attention`}</h3>
    <p>{`Good examples of attention based TTS models are Tacotron and Tacotron2
[`}<a parentName="p" {...{
        "href": "https://arxiv.org/abs/1703.10135"
      }}>{`1`}</a>{`][`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1712.05884"
      }}>{`2`}</a>{`].
Tacotron2 is also the main architecture used in this work. These models
comprise a sequence-to-sequence architecture with an encoder, an
attention-module, a decoder and an additional stack of layers called Postnet.
The encoder takes an input text and computes a hidden representation from which
the decoder computes predictions of the target acoustic feature frames. A
context-based attention mechanism is used to align the input text with the
predictions. Finally, decoder predictions are passed over the Postnet which
predicts residual information to improve the reconstruction performance of the
model. In general, mel-spectrograms are used as acoustic features to represent
audio signals in a lower temporal resolution and perceptually meaningful way.`}</p>
    <p>{`Tacotron proposes to compute multiple non-overlapping output frames by the
decoder. You are able to set the number of output frames per decoder step which
is called “reduction rate” (r). Larger the reduction rate, fewer the number of
decoder steps required for the model to produce the same length output.
Thereby, the model achieves faster training convergence and easier attention
alignment, as explained in [`}<a parentName="p" {...{
        "href": "https://arxiv.org/abs/1703.10135"
      }}>{`1`}</a>{`]. However,
larger r values also produce smoother output frames and therefore, reduce the
frame-level details.`}</p>
    <p>{`Although these models are used in TTS systems for more natural-sounding speech,
they frequently suffer from attention alignment problems, especially at
inference time, because of out-of-domain words, long input texts, or
intricacies of the target language. One solution is to use larger r for a
better alignment however, as noted above, it reduces the quality of the
predicted frames. DDC tries to mitigate these attention problems by introducing
a new architecture.`}</p>
    <div align="center">
      <p><span parentName="p" {...{
          "className": "gatsby-resp-image-wrapper",
          "style": {
            "position": "relative",
            "display": "block",
            "marginLeft": "auto",
            "marginRight": "auto",
            "maxWidth": "413px"
          }
        }}>{`
      `}<span parentName="span" {...{
            "className": "gatsby-resp-image-background-image",
            "style": {
              "paddingBottom": "151.2%",
              "position": "relative",
              "bottom": "0",
              "left": "0",
              "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAeCAYAAAAsEj5rAAAACXBIWXMAAAsTAAALEwEAmpwYAAADIElEQVRIx6VWaU8qQRDc//97+GKMERBNQI6ATw4FEQQ5BVGOvftV9WPIiivic5LJLrs9NTXV1b1Y8sUIw1Anh+/7sl6vpVgsytnZmTw8PEgQBLu46LDkwDDBruvK4+OjDIdDabfb0uv1PsV8C2gCl8uljMdj6XQ6MhgMxHGcTyf4EeDLy4tcX19LKpWSZDKpTKPvf8yQV+p3d3enYK+vr7FARwOa8fz8LNVqVTzP+x0gGXW7Xc3sfD7XhER1PAowKniz2ZSrqyvJZDKSz+elUqnI+/v7/wOS1WQy0cQ8PT3JYrFQX/4IME5HepAa2rb9Ow3NdTqdqp5kF62S6FTA/YdxQa1WS05OTuT09FR1jAMz0zqmllkt5+fnkk6n5e3t7fCRA1gg2GzEh3n91Uo8TF5DPAv02VIgnCyQmCWqhvceNjCxvPeQKE5l6GFHf7OW/f1YHWwKIfRi4Gw2O6idB31DmN5SNttAZpALWQ3sKiw3JoLg/F2v179MmjJUQNy4oL0AU1YDQdhdVtiIg1dqSNuwYrjBPtMPgDZ0maOc2ij8RqOhi5hVtioupKlZKYlEQnK5nL6PJmbXM1EAAQEd7Djv92UCnzVwJC5iyzLlNRqN9LiFQkF7IgEpwf5wsSaA5hapBtjRg2GLWMSFLDEelaDUlZpSCtMUDDtuZKRxQEgBXWYPLKfYIQmvEZCJ4WIemWDMNpmZxeaYlGiOWAWENP8AAWTjB8cGRyErAm7gQyapUi5LHo3h8vJSMhcXks1mpYcTrAE+5CdhW9sOEklPK6ALuvp1A6MNAv6gCYwQMMbzKnStYVZubqRxfy/3SNgMCfDB0otUy05DZseBJfyt40MwdMBwgQ5NS9GwPmYDUqzGI60UYSyeOYjhUW1egUE/WyGSwbJj+fE+8Fz1EwRUG9h45wGkCb360NGhLKwKvF+j+a448c7f6hvfYM0xAEjb1LGgVqtJDlpmYasOrQMC0ViTLOtQ22KFMLu3t7fqT7auMpLE70o04x/64aHmSsuYT0CpVNINaGpj7NgGe+xnlLV86P23H6l9BmS7zybu78hf8VIRuiBoTcEAAAAASUVORK5CYII=')",
              "backgroundSize": "cover",
              "display": "block"
            }
          }}></span>{`
  `}<img parentName="span" {...{
            "className": "gatsby-resp-image-image",
            "alt": "IMAGE",
            "title": "IMAGE",
            "src": "/static/e4c7359592414d8f0b25f9c24e75284e/e1a93/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-model-overview.png",
            "srcSet": ["/static/e4c7359592414d8f0b25f9c24e75284e/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-model-overview.png 250w", "/static/e4c7359592414d8f0b25f9c24e75284e/e1a93/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-model-overview.png 413w"],
            "sizes": "(max-width: 413px) 100vw, 413px",
            "style": {
              "width": "100%",
              "height": "100%",
              "margin": "0",
              "verticalAlign": "middle",
              "position": "absolute",
              "top": "0",
              "left": "0"
            },
            "loading": "lazy"
          }}></img>{`
    `}</span></p>
    </div>
    <p>{`The bare-bone model used in this work is formalized as follows:`}</p>
    <div align="center">
      <p><span parentName="p" {...{
          "className": "gatsby-resp-image-wrapper",
          "style": {
            "position": "relative",
            "display": "block",
            "marginLeft": "auto",
            "marginRight": "auto",
            "maxWidth": "455px"
          }
        }}>{`
      `}<span parentName="span" {...{
            "className": "gatsby-resp-image-background-image",
            "style": {
              "paddingBottom": "92.4%",
              "position": "relative",
              "bottom": "0",
              "left": "0",
              "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAASCAYAAABb0P4QAAAACXBIWXMAAAsTAAALEwEAmpwYAAAB2ElEQVQ4y32U547DMAyD+/7Plh8FinTv3bRN9x46fMIx8OXSGjAcKxZNkUpK9jve77ev5/PZttutrddre71edjqdbLVa2X6/99hms8nW4/Foh8PhT34pD5imqXU6Het2uw6i/Ww2s16vZ8vl0vfMfr/v+0LAMMh4PB7OFFDAYJkkiTOjCp19Pp/fGWql1MVi4SVNJhMbjUbOiBig9/vdJVFOIcMQ8HK52HQ6dYYAUe54PPZnaSj9wtyPgJQMKEwol2QkYKIrhux2u0wCzn5lyCG0AwCD2u22z8FgYNVqNWPNO2nsgKF2oRasapv5fO7iw5Y1NE/VSM9SHkjjer26ATgbRZENh0NnKCZFef80JKA2CEde/E/DAXFRrVCr1axcLlur1XJ9EJ6hr4MJS9iTTFMzMQmDvGR0wQBYyFVWLhFbgG+3WxbnLHviEMI4dHRAbm02m1av101sYYdmJLKP49jdrFQq7iguNxoNZ4ammJaVLPe4Uf3EzRzU16CVBC7FeUqEDDnMQlNIQkdup89UIntYskpfnmHL16RyHTC0X30HELpQMjFYMXmGFXGM4lOEKec//hw0KBMHYYHG6ku0pkQq4VLA0DYzJf+nyTcsrOS29OZdWKZijB/1dGxxHQvMlAAAAABJRU5ErkJggg==')",
              "backgroundSize": "cover",
              "display": "block"
            }
          }}></span>{`
  `}<img parentName="span" {...{
            "className": "gatsby-resp-image-image",
            "alt": "IMAGE",
            "title": "IMAGE",
            "src": "/static/62f5e5526dd116e8aa7c2ac1736457e6/d2e8e/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-1.png",
            "srcSet": ["/static/62f5e5526dd116e8aa7c2ac1736457e6/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-1.png 250w", "/static/62f5e5526dd116e8aa7c2ac1736457e6/d2e8e/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-1.png 455w"],
            "sizes": "(max-width: 455px) 100vw, 455px",
            "style": {
              "width": "100%",
              "height": "100%",
              "margin": "0",
              "verticalAlign": "middle",
              "position": "absolute",
              "top": "0",
              "left": "0"
            },
            "loading": "lazy"
          }}></img>{`
    `}</span></p>
    </div>
    <p>{`where y`}<sub>{`k`}</sub>{` is a sequence of acoustic feature frames. x`}<sub>{`l`}</sub>{` is
a sequence of characters or phonemes, from which we compute sequence of encoder
outputs h`}<sub>{`l`}</sub>{`. r is the reduction factor which defines the number of
output frames per decoder step. Attention alignments, query vector and encoder
output at decoder step t are donated by a`}<sub>{`t`}</sub>{`, q`}<sub>{`t`}</sub>{`, and
o`}<sub>{`t`}</sub>{` respectively. Also, o`}<sub>{`t`}</sub>{` defines a set of output frames
whose size changed by r. Total number of decoder steps is donated by T.`}</p>
    <p>{`Note that teacher forcing is applied at training. Therefore, K=Tr at training
time. However, the decoder is instructed to stop at inference by a separate
network (Stopnet) which predicts a value in a range `}{`[0, 1]`}{`. If its prediction
is larger than a defined threshold, the decoder stops inference.`}</p>
    <h3 {...{
      "id": "double-decoder-consistency",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#double-decoder-consistency",
        "aria-label": "double decoder consistency permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Double Decoder Consistency`}</h3>
    <p>{`DDC is based on two decoders working simultaneously with different reduction
factors (r). One decoder (coarse) works with a large, and the other decoder
(fine) works with a small reduction factor.`}</p>
    <p>{`DDC is designed to settle the trade-off between the attention alignment and the
predicted frame quality tunned by the reduction factor. In general, standard
models have more robust attention performance with a larger r but due to the
smoothing effect of multiple-frames prediction per iteration, final acoustic
features are coarser compared to lower reduction factor models.`}</p>
    <p>{`DDC combines these two properties at training time as it uses the coarse
decoder to guide the fine decoder to preserve the attention performance without
a loss of precision in acoustic features. DDC achieves this by introducing an
additional loss function comparing the attention vectors of these two decoders.`}</p>
    <p>{`For each training step, both decoders compute their relative attention vectors
and the outputs. Due to the differences in their respective r values, their
attention vectors are different lengths. The coarse decoder produces a shorter
vector compared to the fine decoder. In order to mitigate this, we interpolate
the coarse attention vector to match the length of the fine attention vector.
After coercing them in the same length we use a loss function to penalize the
difference in the alignments. This loss is able to synchronize two decoders
with respect to their alignments.`}</p>
    <p><span parentName="p" {...{
        "className": "gatsby-resp-image-wrapper",
        "style": {
          "position": "relative",
          "display": "block",
          "marginLeft": "auto",
          "marginRight": "auto",
          "maxWidth": "1000px"
        }
      }}>{`
      `}<span parentName="span" {...{
          "className": "gatsby-resp-image-background-image",
          "style": {
            "paddingBottom": "79.6%",
            "position": "relative",
            "bottom": "0",
            "left": "0",
            "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsTAAALEwEAmpwYAAACMUlEQVQ4y3VUiXaiQBD0/z9t80yyxvXEgyjigdwIDEdvVb81QeLOe/0YmJnqquoeBoLRtq30n8/m/XFf68agu3gfruvKarV6AMvzXMIwlDRNJY5jqev6aZJBn8XhcJDRaCRZlj0kIQgBgyCQ4/GogMYYqapKyrLU/Uz6AzCKIpnP53ogSRJltF6vZb/fi+d5Ot9utzpfLpdiWZaqmUwmcrlcvgHvT26eTqe6SDAyul6vGkxApvzGxI7jaAKukaV62DWUg0yGw6HK6w9KvN1uKpPATHo6nTTRg4ddliUOjcdj9aOC7KIoJEfQIwKRPdf+V/FBi0NNXQmf+CIGh8coSn7LJGYRICeFvBgyDdh9vL1JinmDfYZJslTqIv8CHTQG2ptaSrYFNtZgGEehJigAoAF5lMvBYpEpmdNjA+9q7PkCLPCR5m5QDBrNlngFiwgglMqDSeCrVN/3ZTabqb9N04ht21rEPAy6gInMsYkHWKnz+SxTtECOrCwATfcOjvx+f1cgAhJY/cS+5WIhVRJ/A7YA2W42skW2DABHAFjoqxBsU3iUQEENPymV7/bnp4aDC7BnoCsakGrvgIYv8CdAo3oEA4MRJLv7nRahovlxJBnCB/sDwNzdTjxYY3gFURRidIpiRAvDatNsgNho1gCyCkhMrIVWEqaxItoJGtgb/hlLBjse2kZf/vUSjV7Ak18vLyqbnja8Afde6wUJtEzU6eMfN4UNzFvQ69qnv6r+peDzL6oh0h8mn7MgAAAAAElFTkSuQmCC')",
            "backgroundSize": "cover",
            "display": "block"
          }
        }}></span>{`
  `}<img parentName="span" {...{
          "className": "gatsby-resp-image-image",
          "alt": "IMAGE",
          "title": "IMAGE",
          "src": "/static/ef79d367f6ad0564ef7c89cf33a51067/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-ddc-overview.png",
          "srcSet": ["/static/ef79d367f6ad0564ef7c89cf33a51067/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-ddc-overview.png 250w", "/static/ef79d367f6ad0564ef7c89cf33a51067/c6e3d/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-ddc-overview.png 500w", "/static/ef79d367f6ad0564ef7c89cf33a51067/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-ddc-overview.png 1000w", "/static/ef79d367f6ad0564ef7c89cf33a51067/f1720/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-ddc-overview.png 1024w"],
          "sizes": "(max-width: 1000px) 100vw, 1000px",
          "style": {
            "width": "100%",
            "height": "100%",
            "margin": "0",
            "verticalAlign": "middle",
            "position": "absolute",
            "top": "0",
            "left": "0"
          },
          "loading": "lazy"
        }}></img>{`
    `}</span></p>
    <p>{`The two decoders take the same input from the encoder. They also compute the
outputs in the same way except they use different reduction factors. The coarse
decoder uses a larger reduction factor compared to the fine decoder. These two
decoders are trained with separate loss functions comparing their respective
outputs with the real feature frames. The only interaction between these two
decoders is the attention loss applied to compare their respective attention
alignments.`}</p>
    <div align="center">
      <p><span parentName="p" {...{
          "className": "gatsby-resp-image-wrapper",
          "style": {
            "position": "relative",
            "display": "block",
            "marginLeft": "auto",
            "marginRight": "auto",
            "maxWidth": "706px"
          }
        }}>{`
      `}<span parentName="span" {...{
            "className": "gatsby-resp-image-background-image",
            "style": {
              "paddingBottom": "33.199999999999996%",
              "position": "relative",
              "bottom": "0",
              "left": "0",
              "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAAAsTAAALEwEAmpwYAAAA0UlEQVQoz32RiQqEMAxE+/9/WZB6X/U+s7xAiyxioMaE5s1ETVEU0ratdF0nzjl9H8dR0jSVLMukqirNTdNo33svxH3fev7DhAEul2WpwGVZpO97HSYDoz9Nk977BFprBZfrukqSJFLXtYIQeToEyBmGIQLfwvAAhpM8z9UNDgFRk1kfMYSv64ru3o4JStu26TAugAIBSEbsue5XmOe3CN+QtYHhlB4gtjjPU+Z51rWpMUGN2HEcei86ZBWcAcNR+CnUQBlCjLXp7fsehekDZv4HoP4cKys79uIAAAAASUVORK5CYII=')",
              "backgroundSize": "cover",
              "display": "block"
            }
          }}></span>{`
  `}<img parentName="span" {...{
            "className": "gatsby-resp-image-image",
            "alt": "IMAGE",
            "title": "IMAGE",
            "src": "/static/370c32ce06b13bf6e357b23241e24ab6/a1ee8/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-2.png",
            "srcSet": ["/static/370c32ce06b13bf6e357b23241e24ab6/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-2.png 250w", "/static/370c32ce06b13bf6e357b23241e24ab6/c6e3d/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-2.png 500w", "/static/370c32ce06b13bf6e357b23241e24ab6/a1ee8/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-equation-2.png 706w"],
            "sizes": "(max-width: 706px) 100vw, 706px",
            "style": {
              "width": "100%",
              "height": "100%",
              "margin": "0",
              "verticalAlign": "middle",
              "position": "absolute",
              "top": "0",
              "left": "0"
            },
            "loading": "lazy"
          }}></img>{`
    `}</span></p>
    </div>
    <h3 {...{
      "id": "other-model-update",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#other-model-update",
        "aria-label": "other model update permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Other Model Update`}</h3>
    <h4 {...{
      "id": "batch-norm-prenet",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#batch-norm-prenet",
        "aria-label": "batch norm prenet permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Batch Norm Prenet`}</h4>
    <p>{`The Prenet is an important part of Tacotron-like auto-regressive models. It
projects model output frames before passing to the decoder. Essentially, it
computes an embedding space of the feature (spectrogram) frames by which the
model de-factors the distribution of upcoming frames.`}</p>
    <p>{`I replaced the original Prenet (PrenetDropout) with the one using Batch
Normalization [`}<a parentName="p" {...{
        "href": "https://arxiv.org/abs/1502.03167"
      }}>{`3`}</a>{`] (PrenetBN) after each
dense layer, and I removed the Dropout layers. Dropout is necessary for
learning attention, especially when the data quality is low. However, it causes
problems at inference due to distributional differences between training and
inference time. Using Batch Normalization is a good alternative. It avoids the
issues of Dropout and also provides a certain level of regularization due to
the noise of batch-level statistics. It also normalizes computed embedding
vectors and generates a well-shaped embedding space.`}</p>
    <h4 {...{
      "id": "gradual-training",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#gradual-training",
        "aria-label": "gradual training permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Gradual Training`}</h4>
    <p>{`I used a gradual training scheme for the model training. I’ve introduced
gradual training in a previous `}<Link to='/blog/tts/gradual-training-with-tacotron-for-faster-convergence' mdxType="Link">{`blog
post`}</Link>{`. In short, we start the model training with a larger reduction
factor and gradually reduce it as the model saturates.`}</p>
    <p>{`Gradual Training shortens the total training time significantly and yields
better attention performance due to its progression from coarse to fine
information levels.`}</p>
    <h4 {...{
      "id": "recurrent-postnet-at-inference",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#recurrent-postnet-at-inference",
        "aria-label": "recurrent postnet at inference permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Recurrent PostNet at Inference`}</h4>
    <p>{`The Postnet is the part of the network applied after the Decoder to improve the
Decoder predictions before the vocoder. Its output is summed with the Decoder’s
to form the final output of the model. Therefore, it predicts a residual which
improves the Decoder output. So we can also apply Postnet more than one time,
assuming it computes useful residual information each time. I applied this
trick only at inference and observe that, up to a certain number of iterations,
it improves the performance. For my experiments, I set the number of iterations
to 2.`}</p>
    <h4 {...{
      "id": "mb-melgan-vocoder-with-multiple-random-window-discriminator",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#mb-melgan-vocoder-with-multiple-random-window-discriminator",
        "aria-label": "mb melgan vocoder with multiple random window discriminator permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`MB-Melgan Vocoder with Multiple Random Window Discriminator`}</h4>
    <p>{`As a vocoder, I use Multi-Band Melgan [`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/2005.05106"
      }}>{`11`}</a>{`]
generator. It is trained with Multiple Random Window Discriminator
(RWD)[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1909.11646"
      }}>{`13`}</a>{`] which is different than in the
original work [`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/2005.05106"
      }}>{`11`}</a>{`] where they used
Multi-Scale Melgan Discriminator (MSMD)[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1909.11646"
      }}>{`12`}</a>{`].`}</p>
    <p>{`The main difference between these two is that RWD uses audio level information
and MSMD uses spectrogram level information. More specifically, RWD comprises
multiple convolutional networks each takes different length audio segments with
different sampling rates and performs classification whereas MSMD uses
convolutional networks to perform the same classification on STFT output of the
target voice signal.`}</p>
    <p>{`In my experiments, I observed RWD yields better results with more natural and
less abberated voice.`}</p>
    <h3 {...{
      "id": "related-work",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#related-work",
        "aria-label": "related work permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Related Work`}</h3>
    <p>{`Guided attention [`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1710.08969"
      }}>{`4`}</a>{`] uses a soft diagonal
mask to force the attention alignment to be diagonal. As we do, it uses this
constant mask at training time to penalize the model with an additional loss
term. However, due to its constant nature, it dictates a constant prior to the
model which does not always to be true, especially long sentences with various
pauses. It also causes skipping in my experiments which are tried to be solved
by using a windowing approach at inference time in their work.`}</p>
    <p>{`Using multiple decoders is initially introduced by
[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1907.09006"
      }}>{`5`}</a>{`]. They use two decoders that run in
forward and backward directions through the encoder output. The main problem
with this approach is that because of the use of two decoders with identical
reduction factors, it is almost 2 times slower to train compared to a vanilla
model. We solve the problem by using the second decoder with a higher reduction
rate. It accelerates the training significantly and also gives the user the
opportunity to choose between the two decoders depending on run-time
requirements. DDC also does not use any complex scheduling or multiple loss
signals that aggravates the model training.`}</p>
    <p>{`Lately, new TTS models introduced by
[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1905.0926"
      }}>{`7`}</a>{`][`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/2005.11129"
      }}>{`8`}</a>{`][`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/2006.04558"
      }}>{`9`}</a>{`][`}<a parentName="p" {...{
        "href": "https://doi.org/10.1109/icassp40776.2020.9054484"
      }}>{`10`}</a>{`]
predict output duration directly from the input characters. These models train
a duration-predictor or use approximation algorithms to find the duration of
each input character. However, as you listen to their samples, one can observe
that these models lead to degraded timbre and naturalness. This is because of
the indirect hard alignment produced by these models. However, models with
soft-attention modules can adaptively emphasize different parts of the speech
producing a more natural speech.`}</p>
    <h3 {...{
      "id": "results-and-experiments",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#results-and-experiments",
        "aria-label": "results and experiments permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Results and Experiments`}</h3>
    <h4 {...{
      "id": "experiment-setup",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#experiment-setup",
        "aria-label": "experiment setup permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Experiment Setup`}</h4>
    <p>{`All the experiments are performed using LJspeech dataset
[`}<a parentName="p" {...{
        "href": "https://keithito.com/LJ-Speech-Dataset/"
      }}>{`6`}</a>{`] . I use a sampling-rate of 22050
Hz and mel-scale spectrograms as the acoustic features. Mel-spectrograms are
computed with hop-length 256, window-length 1024. Mel-spectrograms are
normalized into `}{`[-4, 4]`}{`. You can see the used audio parameters below in our TTS
config format.`}</p>
    <pre><code parentName="pre" {...{
        "className": "language-js"
      }}>{`// AUDIO PARAMETERS
    "audio":{
        // stft parameters
        "num_freq": 513,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data,
                                //   it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured.
                                //  If 0.0, no -pre-emphasis.
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false),
                                //  TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices.
                                //   Tune for dataset!!
        "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise
                                //   range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
    },
`}</code></pre>
    <p>{`I used Tacotron2[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1712.05884"
      }}>{`2`}</a>{`] as the base architecture
with location-sensitive attention and applied all the model updates expressed
above. The model is trained for 330k iterations and it took 5 days with a
single GPU although the model seems to produce satisfying quality after only 2
days of training with DDC. I used a gradual training schedule shown below. The
model starts with r=7 and batch-size 64 and gradually reduces to r=1 and
batch-size 32. The coarse decoder is set r=7 for the whole training.`}</p>
    <pre><code parentName="pre" {...{
        "className": "language-js"
      }}>{`{
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], // [first_step, r, batch_size]
}
`}</code></pre>
    <p>{`I trained MB-Melgan vocoder using real spectrograms up to 1.5M steps, which
took 10 days on a single GPU machine. For the first 600K iterations, it is
pre-trained with only the supervised loss as in
[`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/2005.05106"
      }}>{`11`}</a>{`] and than the discriminator is enabled
for the rest of the training. I do not apply any learning rate schedule and I
used 1e-4 for the whole training.`}</p>
    <h4 {...{
      "id": "ddc-attention-performance",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#ddc-attention-performance",
        "aria-label": "ddc attention performance permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`DDC Attention Performance`}</h4>
    <p>{`The image below shows the validation alignments of the fine and the coarse
decoders which have r=1 and r=7 respectively. We observe that two decoders show
almost identical attention alignments with a slight roughness with the coarse
decoder due to the interpolation.`}</p>
    <p>{`DDC significantly shortens the time required to learn the attention alignmet.
In my experiments, the model is able to align just after 1k steps as opposed to
~8k steps with normal location-sensitive attention.`}</p>
    <p><span parentName="p" {...{
        "className": "gatsby-resp-image-wrapper",
        "style": {
          "position": "relative",
          "display": "block",
          "marginLeft": "auto",
          "marginRight": "auto",
          "maxWidth": "1000px"
        }
      }}>{`
      `}<span parentName="span" {...{
          "className": "gatsby-resp-image-background-image",
          "style": {
            "paddingBottom": "30%",
            "position": "relative",
            "bottom": "0",
            "left": "0",
            "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAYAAADDl76dAAAACXBIWXMAAAsTAAALEwEAmpwYAAABmUlEQVQY002QT0hUURSHj5iUoq1a5R/CbYERVMZYOjrPN+orx9LnjEENNKWCRUWmFYajjrQR0k0rIdCFTLmQoTZBhDWWtBhdFDhkugwhce+MX+e9EfLCj7s4H9/9nSvzUwmaioKEKrqxSyPYxyN0lt+mrfQWg9YTtneWcc676Q9YR68TrLjznytTrjzC3dpH/N35plQGmRmNc0Z8+PLa8UpAcxXvERuPXOHGqT5Wt1Zd4duXCc5KIw1yTZm2HHfYpkZascsipP6kyLrCsTjnxY95yMZXoCkMYuhdp+DN0/dZ3PyVE04muCDNmPkO14GvqHOfCxCq7OHTRppdVcps7A3VKvTrOkZxCCO/A1NBp2246iHJn79dofM1HmnBXxLKcVrAKVGvbbsqe/n8Y53dvT3kdXSOk9LAJR14xHJTI5c5JybBE718XUq7wvjEAlVicFEfOsg52wWOhVlKrpHJasOP80nueYd4Zo0z0DTCY3+Up1aM/sZhJrpfkV7ZdIVf3n/ngfFcuZjL9ZvD+1yUF+Ep1lIbZDNZ/gG2qRGhZ08LsQAAAABJRU5ErkJggg==')",
            "backgroundSize": "cover",
            "display": "block"
          }
        }}></span>{`
  `}<img parentName="span" {...{
          "className": "gatsby-resp-image-image",
          "alt": "IMAGE",
          "title": "IMAGE",
          "src": "/static/b2806805fe690ce3e06f39c3cb3e635e/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignment.png",
          "srcSet": ["/static/b2806805fe690ce3e06f39c3cb3e635e/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignment.png 250w", "/static/b2806805fe690ce3e06f39c3cb3e635e/c6e3d/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignment.png 500w", "/static/b2806805fe690ce3e06f39c3cb3e635e/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignment.png 1000w", "/static/b2806805fe690ce3e06f39c3cb3e635e/99b7b/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignment.png 1360w"],
          "sizes": "(max-width: 1000px) 100vw, 1000px",
          "style": {
            "width": "100%",
            "height": "100%",
            "margin": "0",
            "verticalAlign": "middle",
            "position": "absolute",
            "top": "0",
            "left": "0"
          },
          "loading": "lazy"
        }}></img>{`
    `}</span></p>
    <p>{`At inference time, we ignore the coarse decoder and use only the fine decoder.
The image below depicts the model outputs and attention alignments at inference
time with 4 different sentences that are not seen at training time. This shows
us that the fine decoder is able to generalize successfully on novel sentences.`}</p>
    <p><span parentName="p" {...{
        "className": "gatsby-resp-image-wrapper",
        "style": {
          "position": "relative",
          "display": "block",
          "marginLeft": "auto",
          "marginRight": "auto",
          "maxWidth": "1000px"
        }
      }}>{`
      `}<span parentName="span" {...{
          "className": "gatsby-resp-image-background-image",
          "style": {
            "paddingBottom": "39.6%",
            "position": "relative",
            "bottom": "0",
            "left": "0",
            "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAYAAAD5nd/tAAAACXBIWXMAAAsTAAALEwEAmpwYAAACHklEQVQoz1WR60uTcRTHHwiiP6BQgiRp2gVasm6GYAyE6VpO5mWWkruY6HTpKtGFecm5WjqjbC1TNIrqXS/7h4Rmbu6ic27P8+z59Nt8Uy++HM73fOCcw1caNDzFKNkwHuvEddPHdj5Fx/QGdT0B6u4v0D4ZIavs4zVMcPt4O8ZKBz1XR4jnUtiWv3LRE+KSb4luR5h4Mo7kuODlhmSmXshe62FrP4N5YZXavll07jlM4+9JHxwwcMXH9RNWwdmwnnYTywju3TpnvXNUPQtievSG2M4u0sy9JXrPD/NA58FvCbKT3MPz8zOWyGvMHxbpX90gmzpkfixCr36EvnPDPDHNkdzdw/drk5a1EM2bYZxf1olvZ5CKahFVVlFVIUVF0zTkolJWQTuqJU8tcQXlP04V84J6xMiK4IoakiIguSBTOJRRxBCKKHkFpdRnZYp5WXga/3JqmSt5atmTS4sUwYklkt8cwF7ppqt6iIneF6TzCQbCP7A2+TE9XMQ7ExUXZJi6G6KzwoVdJ17umiIjpxgKf8PSNk2TfwXPTIS9gwRSnwjlmmQVwbRirxliK5em5dUn9LfGqB6cpXl0hXQuS/9lHwbJIkJppbWqn5gIzzq9ht7wmBpXgDvOZWKJJNJ4R5C2M25sJx2MNj7nTzqN83uURm+QxuV57G+j7Kdy+LtDWCscdJxyMlg/yW4yg2tznQbXSxrCQTqDH9n5neIvQKfD2Urvl0EAAAAASUVORK5CYII=')",
            "backgroundSize": "cover",
            "display": "block"
          }
        }}></span>{`
  `}<img parentName="span" {...{
          "className": "gatsby-resp-image-image",
          "alt": "IMAGE",
          "title": "IMAGE",
          "src": "/static/09e724b596058252333f24f77120914e/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png",
          "srcSet": ["/static/09e724b596058252333f24f77120914e/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 250w", "/static/09e724b596058252333f24f77120914e/c6e3d/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 500w", "/static/09e724b596058252333f24f77120914e/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 1000w", "/static/09e724b596058252333f24f77120914e/2e9ed/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 1500w", "/static/09e724b596058252333f24f77120914e/9fabd/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 2000w", "/static/09e724b596058252333f24f77120914e/7b24f/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-alignments.png 2048w"],
          "sizes": "(max-width: 1000px) 100vw, 1000px",
          "style": {
            "width": "100%",
            "height": "100%",
            "margin": "0",
            "verticalAlign": "middle",
            "position": "absolute",
            "top": "0",
            "left": "0"
          },
          "loading": "lazy"
        }}></img>{`
    `}</span></p>
    <p>{`I used 50 hard-sentences introduced by [`}<a parentName="p" {...{
        "href": "http://arxiv.org/abs/1905.09263"
      }}>{`7`}</a>{`]
to check the attention quality of the DDC model. As you see in the
`}<a parentName="p" {...{
        "href": "https://colab.research.google.com/gist/erogol/32d22e21eaa1d0cc0cb52f0fd0c72c55/ddc_sentece_test_330k.ipynb"
      }}>{`notebook`}</a>{`
(Open it on Colab to listen to Griffin-Lim based voice samples), the DDC model
performs without any alignment problems. It is the first model, to my
knowledge, which performs flawlessly on these sentences.`}</p>
    <h4 {...{
      "id": "recurrent-postnet",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h4" {...{
        "href": "#recurrent-postnet",
        "aria-label": "recurrent postnet permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Recurrent Postnet`}</h4>
    <p>{`In the image below we see the average L1 difference between the real
mel-spectrogram and the model prediction for each Postnet iteration. The
results improve until the 3rd iteration. We also observe that some of the
artifacts after the first iteration are removed by the second iteration that
yields a better L1 value. Therefore, we see here how effective the iterative
application of the Posnet to improve the final model predictions.`}</p>
    <p><span parentName="p" {...{
        "className": "gatsby-resp-image-wrapper",
        "style": {
          "position": "relative",
          "display": "block",
          "marginLeft": "auto",
          "marginRight": "auto",
          "maxWidth": "1000px"
        }
      }}>{`
      `}<span parentName="span" {...{
          "className": "gatsby-resp-image-background-image",
          "style": {
            "paddingBottom": "70%",
            "position": "relative",
            "bottom": "0",
            "left": "0",
            "backgroundImage": "url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAAsTAAALEwEAmpwYAAADiUlEQVQ4yyVT209TBxzu25K5l6k4NAg4vA2VobTFwjmnt9PTlspkm8JkoAXaQlvOaemNUi7uYYBSkIK36LIs0QcTl/mibFnM4mXZYmJcZEOjJsseZvawPZBsf8C375w+fPmd3zm/3/e7fcfUsTsF9+ZBHNmRgL9Ghb9WRVu1ikCtBrkyioh7Bq/X72MiXIKzku/3pOHfOQI/83Qb2JWCXD2MobYF/PH3CkxKZQzNb5yCs2IIDhI7KohNg3BuDMO2oQ+9zVN4/s+PGOkuwrohBMd2FfZtcTiqhuHYSjD/8NthBO0zePLX9zC1s5KdZP6aBLwM9tZo8FZr8BHOrXGEnJ/j/us1pPovkEiFb08Gys4UFOYpu9LwEfYdSXZYxJ3fn8AU2J2GtHkI3toEPLVJIgGlSoVCa982jAHnNO6+fIFE3yVIVRqUvVnIOkjsIbGHz2LdCAYDRdxaXWWH/GCviMJflypXZjUv4aPvIGHINY17T18iFbwEOwt594/Co0Pv9L0svPW5coft87j9868wyQyyvtln7ELSsSUKiQXstIffGkCv7TQePHiG+LESrBsjkDiFyPWIXI+4XYNEa+WEQXkGKyscOXNsEVHvLEY65qG1z0H9YA5a4CyS9OO+WUzHvsRvj15hefImou1FJI+XoH10DqoOxiSYH+P7M+lrePzDGkyjA1egdi4h03MBia4S1K5F+otI91yE1lnCbO4G/v3vMa6WvkH8xEVkOLr26XkMdy9B7V5GOnjZsGfGv8b6+kOY3NybmS2LdRyFhxDeLY8k8cqWd6Lo5lGeUTZa3yKatsQhUIcCc1p5ZYF7Frk/c1Ucvb5iWTYBywREilNpLBjLlt8nGvJQCGlfFv1HF3DvzzUkY1cg8qIe8wTkxjHIB8fgJjwHCxCYE+lcxm1dNr7mSbSwqnyoAJcOPfBA3ghubcgh2LFgyEaNXUVLfRbu5gk4rYR53Ih1WcZhO5RHqGupLJs2fhSoRQ+JXXw2KjcVIDNBPJBD/4cLhmwSsS/QSt8tfQanMGXEumyTcNum0NI0hvCJ5bJs3I15WPjfSg2jEPbnINRniCwkwsq99hyZw0PKJjZwGU17uTdLgQR5tLIrnUigNe/L4CSP+d23v8AU+2QJp9rOYvDjc4iwGwNH5xGhHyTZpPoVnv70AsWpmzhJmUSohBBVEKYCwsdp9XzmnM5dx6O7q/gfk0JYhvkA5uEAAAAASUVORK5CYII=')",
            "backgroundSize": "cover",
            "display": "block"
          }
        }}></span>{`
  `}<img parentName="span" {...{
          "className": "gatsby-resp-image-image",
          "alt": "IMAGE",
          "title": "IMAGE",
          "src": "/static/e4a0e3fe77a7f947321db273f2d8b2ea/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png",
          "srcSet": ["/static/e4a0e3fe77a7f947321db273f2d8b2ea/43fa5/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 250w", "/static/e4a0e3fe77a7f947321db273f2d8b2ea/c6e3d/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 500w", "/static/e4a0e3fe77a7f947321db273f2d8b2ea/da8b6/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 1000w", "/static/e4a0e3fe77a7f947321db273f2d8b2ea/2e9ed/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 1500w", "/static/e4a0e3fe77a7f947321db273f2d8b2ea/9fabd/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 2000w", "/static/e4a0e3fe77a7f947321db273f2d8b2ea/0404f/blog-tts-solving-attention-problems-of-tts-models-with-double-decoder-consistency-recurrent-postnet.png 2044w"],
          "sizes": "(max-width: 1000px) 100vw, 1000px",
          "style": {
            "width": "100%",
            "height": "100%",
            "margin": "0",
            "verticalAlign": "middle",
            "position": "absolute",
            "top": "0",
            "left": "0"
          },
          "loading": "lazy"
        }}></img>{`
    `}</span></p>
    <h3 {...{
      "id": "future-work",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#future-work",
        "aria-label": "future work permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Future Work`}</h3>
    <p>{`First of all I hope this section would not be “here are the things we’ve not
tried and will not try” section.`}</p>
    <p>{`However, there are specifically three aspects of DDC which I like to
investigate more. The first is sharing the weights between the fine and the
coarse decoders to reduce the total number of model parameters and observing
how the shared weights benefit from different resolutions.`}</p>
    <p>{`The second is to measure the level of complexity required by the coarse
decoder. That is, how much simpler the coarse architecture can get without
performance loss.`}</p>
    <p>{`Finally, I like to try DDC with the different model architectures.`}</p>
    <h3 {...{
      "id": "conclusion",
      "style": {
        "position": "relative"
      }
    }}><a parentName="h3" {...{
        "href": "#conclusion",
        "aria-label": "conclusion permalink",
        "className": "anchor before"
      }}><svg parentName="a" {...{
          "xmlns": "http://www.w3.org/2000/svg",
          "width": "16",
          "height": "16",
          "focusable": "false",
          "viewBox": "0 0 16 16"
        }}>{`
  `}<path parentName="svg" {...{
            "fill": "currentColor",
            "d": "M4.441 7.38l.095.083.939.939-.708.707-.939-.939-2 2-.132.142a2.829 2.829 0 003.99 3.99l.142-.132 2-2-.939-.939.707-.708.94.94a1 1 0 01.083 1.32l-.083.094-2 2A3.828 3.828 0 01.972 9.621l.15-.158 2-2A1 1 0 014.34 7.31l.101.07zm7.413-3.234a.5.5 0 01.057.638l-.057.07-7 7a.5.5 0 01-.765-.638l.057-.07 7-7a.5.5 0 01.708 0zm3.023-3.025a3.829 3.829 0 01.15 5.257l-.15.158-2 2a1 1 0 01-1.32.083l-.094-.083-.94-.94.708-.707.939.94 2-2 .132-.142a2.829 2.829 0 00-3.99-3.99l-.142.131-2 2 .939.939-.707.708-.94-.94a1 1 0 01-.082-1.32l.083-.094 2-2a3.828 3.828 0 015.414 0z"
          }}></path>
        </svg></a>{`Conclusion`}</h3>
    <p>{`Here I tried to summarize a new method that significantly accelerates model
training, provides steadfast attention alignment and provides a choice in a
spectrum of quality and speed switching between the fine and the coarse
decoders at inference. The user can choose depending on run-time requirements.`}</p>
    <p>{`You can replicate all this work using our
`}<a parentName="p" {...{
        "href": "https://github.com/coqui-ai/TTS"
      }}>{`TTS`}</a>{`. You can also see voice samples and
Colab Notebooks from the links above. Let me know how it goes if you try DDC in
your project.`}</p>
    <p>{`If you would like to cite this work, please use:`}</p>
    <p><em parentName="p">{`Gölge E. (2020) Solving Attention Problems of TTS models with Double Decoder Consistency.
erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/`}</em></p>


    </MDXLayout>;
}
;
MDXContent.isMDXComponent = true;
      