By Gabriel, 30 Jun 2024 , updated 30 Jun 2024
This post is about self-hosting videos on a website. When it comes to hosting videos on a website, the most common way is to use a video hosting platform like YouTube, Vimeo, or Dailymotion. However, there are some reasons why you might want to host videos on your own server. In this post I will review the pros and cons of self-hosting videos and provide a walkthrough on how to do it. It is today an experiment to see how it goes and haven't made up my mind yet on the best approach.
With video self-hosting one will upload the video file to their own server (can be a VPS, a dedicated server, or a shared hosting account) and then embed the video player on their website. In this approach the user is responsible for the video encoding, storage, and delivery.
The other approach is to use a video hosting platform like YouTube, Vimeo, or Dailymotion. In this case, the user uploads the video to the platform. The platform takes care of the video encoding, storage, and delivery. Then the user can embed the video player on their website or simply share the link to the video on the platform.
There is a lot of reading on the internet about pros and cons of each approach. For video hosting platform, the important benefit to me are: a good CDN, availability of the video in a large number of resolutions and encoding and, for Youtube especially, a (potential) bigger exposure because of the recommendation algorithm that can push your video to new users. For video self-hosting, what seems important to me is: full control of the player and the recommendations. Then of course there is the financial aspect, with the video hosting platform being either free but with ads served to your users or paid.
I do feel that over the years the technical advantage of videos platform have shrink with the development of web standard for videos and some problem with self-hosting videos are now just myths:
Of course the biggest platform of them all, Youtube, is likely to give your content better discovery exposure, ultimately more views, and potentially more revenue if that is the goal. But this post is about the technical aspect: is that reasonably easy to self host video, will it provide a good enough UX to your audience. I want to think it is and I believe there are some type of video usage that are better served by this approach.
How do you go from several raw clips on your phone to a video that is ready to be uploaded to your website with a good enough compression and the good format? Another way to put it: How can you replace all the convenient tools of a video hosting platform by using off-the-shelf tools?
In the description below I have choose to do 2 important simplifications:
I have used iMovie (on Mac) to put together the right content: from multiple phone-recorded clips: selection, cut, combine and add transition, finally export into file. This is the first time I was using the tool, but for non-specialist like me, it was quite easy to use and I was able to produce a video that I was happy with.
I have the exported the video in a 540p resolution (resolution final 304x540 for the example video).
Miscellaneous: I have recorded all the clips in a portrait mode on my phone (720x1280, ie 9:16) and iMovie, AFAIK doesn’t let the user choose the aspect-ratio of the produced video when creating a “Movie” (it defaults to 16:9). Creating a Movie would result in the produced video in my case to have 2 big black bars either side of the video to fit 16:9 without cropping any part of the input clips. The trick I found (on reddit) was to create an “App preview” (instead of a “Movie”) in iMovie. This is a 9:16 ratio video.
The videos produced by my phone do have a high bitrate (3.7Mb/s), ie they are big files, not optimized for web delivery. The video produced by iMovie is not much different (same codecs used), except for the change of resolution that does reduce the file size a bit (bitrate 2.3Mb/s).
The goal is to serve a single-resolution video with the smallest bitrate possible and still have a good enough quality expected when viewed on a mobile device. Some compression defect are ok but it shouldn’t distract the viewer from the content. When comparing with a couple of bitrate of videos elsewhere (see table below) I have settled to a bitrate around 1Mb/s.
Video producer can choose the file format and the codec used for the video and the audio. I wanted to use a modern and open source codec that is supported by most of the browsers. I’m not aiming for 100% browser support because the last ~1% of users force videos producers to use several codecs and formats for the same video. I have chosen the VP9 codec for video and the Opus codec for audio. The WebM container is used for the final video.
This is where the great tool ffmpeg comes in. Initially developed by Fabrice Bellard, it is now maintained by the FFmpeg team. It “is a free and open-source software project consisting of a suite of libraries and programs for handling video, audio, and other multimedia files and streams. At its core is the command-line ffmpeg
tool itself, designed for processing video and audio files.” (from wikipedia). From my understanding, it is the swiss-army knife of video processing, and it is very popular.
With ffmpeg, I have used the VP9 codec for video and the Opus codec for audio and WebM container, ie the video format. It is recommended to use two-pass encoding for VP9, but I have used a single pass because it was much quicker, more appropriate for that small experience (and the time I had to do it!).
On my machine (MacBook Pro 2019 - Intel i9 8 cores - 16 GB), the processing time was 0.5x (eg: a 07min24sec video took 15min to encode)
I have used ffmpeg to extract a frame from the video to use as a thumbnail.
I often find myself watching video with captions on. It is not that I have a hearing problem, but I find it easier to follow the content. I would consider (without any user research to back that up :) ) that it is an important feature to have when serving videos. Transcription is a time-consuming task but I think that now, thanks to machine learning, automatic speech recognition the problem is solved in computing. Solved like chess playing is solved: it reach a point where the machine can deliver human level performance. One can achieve it using open-source tools, running on modest personal computer. I have used OpenAI’s Whisper, which is too my knowledge the first of that tool available.
Precisely I have used the great port of OpenAI’s Whisper to C++ by G. Gerganov: whisper.cpp. It run on a CPU-machine and is available as a docker image. It is a command-line tool that takes an audio file as input and outputs a transcript in the VTT format. It comes with different quantization levels, the higher the level, the better the quality of the transcript but the longer the processing time. I tested the base
models initially, but found that medium.en
was a better fit for my use case.
See caption extract below for the video I have posted here
using base model
[00:00:00.000 --> 00:00:11.400] Hi, so today the walk will be to finish the concrete here so remove all the water
[00:00:11.400 --> 00:00:19.360] accumulated over a weak here heavy rain and concrete it and then do a bit
[00:00:19.360 --> 00:00:24.840] the same thing on the other bearer on the deck so this bearer was put a few
[00:00:24.840 --> 00:00:32.940] months ago now the change it's to it's not kind of that way so I'm gonna unscrew
(...)
using medium.en model
(and using option to limit the segment length to 32 characters for better readability on the screen)
[00:00:00.000 --> 00:00:07.490] Hi, so today the work will be
[00:00:07.490 --> 00:00:10.710] to finish the concrete here, so
[00:00:10.710 --> 00:00:11.820] remove all the water that
[00:00:11.820 --> 00:00:15.010] accumulated over a week here,
[00:00:15.010 --> 00:00:18.070] heavy rain, and do a bit of the
[00:00:18.070 --> 00:00:20.520] same thing on the other
[00:00:20.520 --> 00:00:23.950] bearer on the deck. So this be
[00:00:23.950 --> 00:00:27.270] arer was put a few months ago,
[00:00:27.270 --> 00:00:29.920] now the change it's not
[00:00:29.920 --> 00:00:32.080] far enough that way. So I'm
[00:00:32.080 --> 00:00:34.270] going to unscrew the four
[00:00:34.270 --> 00:00:36.700] attachments to the four posts,
(...)
-> less mistakes and better respect of the punctuation.
On my machine (MacBook Pro 2019 - Intel i9 8 cores - 16 GB), the processing time was
base
: about 8x (eg: 312 sec audio -> 41 sec processing)medium.en
: about 1x the duration of the audio file (eg: 312 sec audio -> 328 sec processing)export FILENAME=20240530--deck
# Encode the video to a webm file, using vp9 codec for video and opus codec for audio
ffmpeg -i ${FILENAME}.mp4 -c:v libvpx-vp9 -crf 35 -b:v 0 -c:a libopus ${FILENAME}.webm
# Extract audio and convert it to wav
ffmpeg -i ${FILENAME}.mp4 -c:a copy -vn audios/${FILENAME}.m4a
ffmpeg -i audios/${FILENAME}.m4a -ar 16000 -ac 1 -c:a pcm_s16le audios/${FILENAME}.wav
# Generate captions
docker run -it --rm -v ${PWD}/models:/models -v ${PWD}/audios:/audios -e FILENAME=${FILENAME} ghcr.io/ggerganov/whisper.cpp:main "./main -m /models/ggml-medium.en.bin -f /audios/${FILENAME}.wav -ml 32 --output-vtt -of /audios/${FILENAME}"
# Extract a frame from the video to use as a thumbnail
ffmpeg -ss 00:00:13 -i ${FILENAME}.mp4 -frames:v 1 -q:v 2 ${FILENAME}.jpg
Example given for a 1min15sec video, processed through the steps describe above, and published here.
Source | Filename | res. | format | video codec | audio codec | size | bitrate (/s) |
---|---|---|---|---|---|---|---|
Phone | VID_20240530_132534.mp4 | 720x1280 | mp4 | h264 | aac | 34.8MB | 3727 kb |
iMovie | 20240530–deck.mp4 | 304x540 | mp4 | h264 | aac | 21.7MB | 2321 kb |
ffmpeg | 20240530–deck.webm | 304x540 | webm | vp9 | opus | 5.7MB | 641 kb |
For comparison here are stats about other videos on different platforms:
Source | Name | duration | res. | format | video codec | audio codec | size | bitrate (/s) |
---|---|---|---|---|---|---|---|---|
a video received 1 (Jan-2024) | 0m52s | 848x480 | mp4 | h264 | aac | 10.0MB | 1575 kb | |
a video sent 1 (May-2024) | 0m58s | 848x478 | mp4 | h264 | aac | 12.0MB | 1695 kb | |
YouTube | Aznavour la boheme (live) | 5m20s | 320x240 | mp4 | h264 | aac | 11.2MB | 286 kb |
YouTube | DIY DECK Part 6 - Railing | 12m11 | 1290x720 (720p) | mp4 | h264 | aac | 91.6MB | 1.0M |
YouTube | How to install Joist hanger | 1m01 | 1290x720 (720p) | mp4 | h264 | aac | 4.01MB | 0.5M |
YouTube | How to build deck part 4 - Boards | 3m37 | 1290x720 (720p) | mp4 | h264 | aac | 25.7MB | 0.9M |
Note: The size of Youtube videos are the size of the downloaded video file with NewPipe on my Android phone.
The video element is supported by all modern browsers. I have used very little of the attributes available and no custom styling at all for this experiment so far:
Example of the rendered markup this video
<video id="video" controls preload="metadata" poster="/assets/videos/posters/20240530--deck.jpg" style="aspect-ratio: 304 / 540">
<source src="/assets/videos/20240530--deck.webm" type="video/webm" />
<track
label="English"
kind="subtitles"
srclang="en"
src="/assets/videos/captions/20240530--deck.vtt"
default />
Download the
<a href="/assets/videos/20240530--deck.webm">WEBM</a>
video.
</video>
Reviewing the main attributes:
I have added structured data markup VideoObject
to help search engines (mainly Google) understand the content of the video, which in turn should help indexing and discovery for the video. Just trying to get back a little bit of the discoverability lost by not using YouTube!
<script type="application/ld+json">{
"@context": "https://schema.org",
"@type": "VideoObject",
"name": "Deck Building Journal - Protecting the Deck",
"description": "Adding 2 layers of water-based varnish to protect the deck.",
"thumbnailUrl": [
"https://www.info2007.net/assets/videos/posters/20240530--deck.jpg"
],
"uploadDate": "2024-05-30T00:00:00+00:00",
"duration": "PT1M14S",
"contentUrl": "https://www.info2007.net/assets/videos/20240530--deck.webm"
}</script>
I have followed that approach to publish 11 videos related to a home project (without editing), see Videos. I am happy with the result. The video quality is good enough for the content. The auto-generated caption is not perfect but I’ll blame my accent and sometimes a poor audio capture with my phone outside! My server is located in France, there is no CDN and when I tested it I did it from my current location in Australia: the video was loading fast enough to not being a noticeable problem.
The research work to put in place the pipeline described above took a couple of hours days, but now adding a video to the website is quite straightforward.
It is just the initial experiment. Hover the coming months, I will probably add more videos: This will create more scenario. I will collect usage feedback from users and from myself! (different browsers, network etc) All of it will refine this experiment and help to reach a broader conclusion as to what can and what cannot be done with self-hosted videos.