Improving Video Accessibility with WebVTT

“The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
– Tim Berners-Lee

Accessibility is an important element of web development, and with the ever-growing prevalence of video content, the necessity for captioned content is growing as well. WebVTT is a technology that solves helps with captioned content as a subtitle format that integrates easily with already-existing web APIs.

That’s what we’re going to look at here in this article. Sure, WebVTT is captioning at its most basic, but there are ways to implement it to make videos (and the captioned content itself) more accessible for users.

Hi, meet the WebVTT format

First and foremost: WebVTT is a type of file that contains the text “WebVTT” and lines of captions with timestamps. Here’s an example:

WEBVTT

00:00:00.000 --> 00:00:03.000
- [Birds chirping]
- It's a beautiful day!

00:00:04.000 --> 00:00:07.000
- [Creek trickling]
- It is indeed!

00:00:08.000 --> 00:00:10.000
- Hello there!

A little weird, but makes pretty good sense, right? As you can see, the first line is “WEBVTT” and it is followed by a time range (in this case, 0 to 3 seconds) on Line 3. The time range is required. Otherwise, the WEBVTT file will not work at all and it won’t even display or log errors to let you know. Finally, each line below a time range represents captions contained in the range.

Note that you can have multiple captions in a single time range. Hyphens may be used to indicate the start of a line, though it’s not required and more stylistic than anything else.

The time range can be one of two formats: hh:mm:ss.tt or mm:ss.tt. Each part follows certain rules:

Hours (hh): Minimum of two digits
Minutes (mm): Between 00 and 59, inclusive
Seconds (ss): Between 00 and 59, inclusive
Milliseconds (tt): Between 000 and 999, inclusive

This may seem rather daunting at first. You’re probably wondering how anyone can be expected to type and tweak this all by hand. Luckily, there are tools to make this easier. For example, YouTube can automatically caption videos for you with speech recognition in addition to allowing you to download the caption as a VTT file as well! But that’s not it. WebVTT can also be used with YouTube as well by uploading your VTT file to your YouTube video.

Once we have this file created, we can then embed it into an HTML5 video element.

<video autoplay="autoplay" controls="controls" width="300" height="150">
  <source src="your_video.mp4" type="video/mp4">
  <track default="" kind="captions" srclang="en" label="English" src="your_caption_file.vtt">
</video>

The tag is sort of like a script that “plays” along with the video. We can use multiple tracks in the same video element. The default attribute indicates that a the track will be enabled automatically.

Let’s run down all the attributes while we’re at it:

srclang indicates what language the track is in.
kind represents the type of track it is and there are five kinds:
- subtitles are usually translations and descriptions of different parts of a video.
- descriptions help unsighted users understand what is happening in a video.
- captions provide un-hearing users an alternative to audio.
- metadata is a track that is used by scripts and cannot be seen by users.
- chapters assist in navigating video content.
label is a title for the text track that appears in the caption track
src is the source file for the track. It cannot come from a cross-origin source unless crossorigin is specified.

While WebVTT is designed specifically for video, you can still use it with audio by placing an audio file within a <video> element.

Digging into the structure of a WebVTT file

MDN has great documentation and outlines the body structure of a WebVTT file, which consists of up to six components. Here’s how MDN breaks it down:

An optional byte order mark (BOM)

The string “WEBVTT“

An optional text header to the right of WEBVTT.

There must be at least one space after WEBVTT.

You could use this to add a description to the file.

You may use anything in the text header except newlines or the string “-->“.

A blank line, which is equivalent to two consecutive newlines.

Zero or more cues or comments.

Zero or more blank lines.

Note: a BOM is a unicode character that indicates the unicode encoding of the text file.

Bold, italic, and underline — oh my!

We can absolutely use some inline HTML formatting in WebVTT files! These are the ones that everyone is familiar with: <b>, <i>, and <u>. You use them exactly as you would in HTML.

WEBVTT

00:00:00.000 --> 00:00:03.000 align:start
This is <b>bold text</b>

00:00:03.000 --> 00:00:06.000 align:middle
This is <i>italic text</i>

00:00:06.000 --> 00:00:09.000 vertical:rl align:middle
This is <u>underlined  text</u>

Cue settings

Cue settings are optional strings of text used to control the position of a caption. It’s sort of like positioning elements in CSS, like being able to place captions on the video.

For example, we could place captions to the right of a cue timing, control whether a caption is displayed horizontally or vertically, and define both the alignment and vertical position of the caption.

Here are the settings that are available to us.

Setting 1: Line

line controls the positioning of the caption on the y-axis. If vertical is specified (which we’ll look at next), then line will instead indicate where the caption will be displayed on the x-axis.

When specifying the line value, integers and percentages are perfectly acceptable units. In the case of using an integer, the distance per line will be equal to the height (from a horizontal perspective) of the first line. So, for example, let’s say the height of the first line of the caption is equal to 50px, the line value specified is 2, and the caption’s direction is horizontal. That means the caption will be positioned 100px (50px times 2) down from the top, up to a maximum equal to coordinates of the boundaries of the video. If we use a negative integer, it will move upward from the bottom as the value decreases (or, in the case of vertical:lr being specified, we will move from right-to-left and vice-versa). Be careful here, as it’s possible to position the captions off-screen in addition to the positioning being inconsistent across browsers. With great power comes great responsibility!

In the case of a percentage, the value must be between 0-100%, inclusive (sorry, no 200% mega values here). Higher values will move the caption from top-to-bottom, unless vertical:lr or vertical:rl is specified, in which case the caption will move along the x-axis accordingly.

As the value increases, the caption will appear further down the video boundaries. As the value decreases (including into the negatives), the caption will appear further up.

Tough picture this without examples, right? Here’s how this translates into code:

00:00:00.000 --> 00:00:03.000 line:50%
This caption should be positioned horizontally in the approximate center of the screen.

00:00:03.000 --> 00:00:06.000 vertical:lr line:50%
This caption should be positioned vertically in the approximate center of the screen.

00:00:06.000 --> 00:00:09.000 vertical:rl line:-1
This caption should be positioned vertically along the left side of the video.

00:00:09.000 --> 00:00:12.000 line:0
The caption should be positioned horizontally at the top of the screen.

Setting 2: Vertical

vertical indicates the caption will be displayed vertically and move in the direction specified by the line setting. Some languages are not displayed left-to-right and instead need a top-to-bottom display.

  00:00:00.000 --> 00:00:03.000 vertical:rl
This caption should be vertical.

00:00:00.000 --> 00:00:03.000 vertical:lr
This caption should be vertical.

Setting 3: Position

position specifies where the caption will be displayed along the x-axis. If vertical is specified, the position will instead specify where the caption will be displayed on the y-axis. It must be an integer value between 0% and 100%, inclusive.

00:00:00.000 --> 00:00:03.000 vertical:rl position:100%
This caption will be vertical and toward the bottom.

00:00:03.000 --> 00:00:06.000 vertical:rl position:0%
This caption will be vertical and toward the top.

At this point, you may notice that line and position are similar to the CSS flexbox properties for align-items and justify-content, and that vertical behaves a lot like flex-direction. A trick for remembering WebVTT directions is that line specifies a position perpendicular to the flow of the text, whereas position specifies the position parallel to the flow of the text. That’s why line suddenly moves along the horizontal axis, and position moves along the vertical axis if we specify vertical.

Setting 4: Size

size specifies the width of the caption. If vertical is specified, then it will set the height of the caption instead. Like other settings, it must be an integer between 0% and 100%, inclusive.

00:00:00.000 --> 00:00:03.000 vertical:rl size:50%
This caption will fill half the screen vertically.

00:00:03.000 --> 00:00:06.000 position:0%
This caption will fill the entire screen horizontally.

Setting 5: Align

align specifies where the text will appear horizontally. If vertical is specified, then it will control the vertical alignment instead.

The values we’ve got are: start, middle, end, left and right. Without vertical specified, the alignments are exactly what they sound like. If vertical is specified, they effectively become top, middle (vertically), and bottom. Using start and end as opposed to left and right, respectively, is a more flexible way of allowing the alignment to be based on the unicode-bidi CSS property’s plaintext value.

Note that align is not unaffected by vertical:lr or vertical:rl.

WEBVTT

00:00:00.000 --> 00:00:03.000 align:start
This caption will be on the left side of the screen.

00:00:03.000 --> 00:00:06.000 align:middle
This caption will be horizontally in the middle of the screen.

00:00:06.000 --> 00:00:09.000 vertical:rl align:middle
This caption will be vertically in the middle of the screen.

00:00:09.000 --> 00:00:12.000 vertical:rl align:end
This caption will be vertically at the bottom right of the screen regardless of vertical:lr or vertical:rl orientation.

00:00:12.000 --> 00:00:15.000 vertical:lr align:end
This caption will be vertically at the bottom of the screen, regardless of the vertical:lr or vertical:rl orientation.

00:00:12.000 --> 00:00:15.000 align:left
This caption will appear on the left side of the screen.

00:00:12.000 --> 00:00:15.000 align:right
This caption will appear on the right side of the screen.

WebVTT Comments

WebVTT comments are strings of text that are only visible when reading the source text of the file, the same way we think of comments in HTML, CSS, JavaScript and any other language. Comments may contain a new line, but not a blank line (which is essentially two new lines).

WEBVTT

00:00:00.000 --> 00:00:03.000
- [Birds chirping]
- It's a beautiful day!

NOTE This is a comment. It will not be visible to anyone viewing the caption.

00:00:04.000 --> 00:00:07.000
- [Creek trickling]
- It is indeed!

00:00:08.000 --> 00:00:10.000
- Hello there!

When the caption file is parsed and rendered, the highlighted line above will be completely hidden from users. Comments can be multi-line as well.

There are three very important characters/strings to take note of that may not be used in comments: <, &, and -->. As an alternative, you can use escaped characters instead.

Not Allowed	Alternative
`NOTE PB&J`	`NOTE PB&J`
`NOTE 5 < 7`	`NOTE 5 < 7`
`NOTE puppy --> dog`	`NOTE puppy --> do`

A few other interesting WebVTT features

We’re going to take a quick look at some really neat ways we can customize and control captions, but that are lacking consistent browser support, at least at the time of this writing.

Yes, we can style captions!

WebVTT captions can, in fact, be styled. For example, to style the background of a caption to be red, set the background property on the ::cue pseudo-element:

video::cue {
  background: red;
}

Remember how we can use some inline HTML formatting in the WebVTT file? Well, we can select those as well. For example, to select and italic (<i>) element:

video::cue(i) {
  color: yellow;
}

Turns out WebVTT files support a style block, a lot like the way HTML files do:

WEBVTT

STYLE
::cue {
  color: blue;
  font-family: "Source Sans Pro", sans-serif;
}

Elements can also be accessed via their cue identifiers. Note that cue identifiers use the same escaping mechanism as HTML.

WEBVTT

STYLE
::cue(#middle\ cue\ identifier) {
  text-decoration: underline;
}
::cue(#cue\ identifier\ \33) {
  font-weight: bold;
  color: red;
}

first cue identifier
00:00:00.000 --> 00:00:02.000
Hello, world!

middle cue identifier
00:00:02.000 --> 00:00:04.000
This cue identifier will have an underline!

cue identifier 3
00:00:04.000 --> 00:00:06.000
This one won't be affected, just like the first one!

Different types of tags

Many tags can be used to format captions. There is a caveat. These tags cannot be used in a element where kind attribute is chapters. Here are some formatting tags you can use.

The class tag

We can define classes in the WebVTT markup using a class tag that can be selected with CSS. Let’s say we have a class, .yellowish that makes text yellow. We can use the tag in a caption. We can control lots of styling this way, like the font, the font color, and background color.

/* Our CSS file */
.yellowish {
  color: yellow;
}
.redcolor {
  color: red;
}

WEBVTT

00:00:00.000 --> 00:00:03.000
This text should be yellow. This text will be the default color.

00:00:03.000 --> 00:00:06.000
This text should be red. This text will be the default color.

The timestamp tag

If you want to make captions appear at specific times, then you will want to use timestamp tags. They’re like fine-tuning captions to exact moments in time. The tag’s time must be within the given time range of the caption, and each timestamp tag must be later than the previous.

WEBVTT

00:00:00.000 --> 00:00:07.000
This <00:00:01.000>text <00:00:02.000&>will <00:00:03.000>appear <00:00:04.000>over <00:00:05.000>6 <00:00:06.000>seconds.

The voice tag

Voice tags are neat in that they help identify who is speaking.

WEBVTT

00:00:00.000 --> 00:00:03.000
How was your day, Bob?

00:00:03.000 --> 00:00:06.000
Great, yours?

The ruby tag

The ruby tag is a way to display small, annotative characters above the caption.

WEBVTT

00:00:00.000 --> 00:00:05.000
<ruby>This caption will have text above it<rt>This text will appear above the caption.</rt></ruby>

Conclusion

And that about wraps it up for WebVTT! It’s an extremely useful technology and presents an opportunity to improve your site’s accessibility a great deal, particularly if you are working with video. Try some of your own captions out yourself to get a better feel for it!

Mitch

# July 17, 2019

Do you know about browser support for the voice tag? Doesn’t seem to be showing up for me. I’m using videoJS.

Flimm

# July 18, 2019

Do WebVTT files need to be encoded in UTF-8? Or can they be encoded in other encodings?

Malc

Permalink to comment# August 17, 2019

Thanks!

Gary Katsevman

# July 22, 2019

Voice tags are supported but I don’t think any browser has default colors for the different voices. You have to provide styles manually.
Styling itself is also complicated as most browsers don’t support it. The vtt.js project, which Video.js uses for captions by default on browsers other than Safari, doesn’t support the CSS extensions in webvtt.

The webvtt file must be utf-8 as per the specification: https://www.w3.org/TR/webvtt/#file-structure