M*A*S*H Archive
Problem
I have about
180GB
of MASH episodes in .mkv
, .mp4
, the transcripts of each episode in two different formats, and I need a way to convert the video from .mkv
, .mp4
into .webm
for easy web playback in chrome.
This makes about 360GB
of MASH episodes I have stored.After trying to find a compatible web media player, I realized that the
.mkv
format is not suitable for streaming in a browser. The .mp4
file type is also not in the right codec for browser playback.
That, fucking sucks.
Encoding video is expensive and time consuming.
But I've always wanted to learn more about the process.
What I didn't want to do, is pipe this TV show into AWS's custom-built video transcoding solutions.
That just seems like a bad idea.
I already probably shouldn't be putting DVD rips in cloud storage.
I can hack my way through this.
Solutions
Start up a huge ec2 instance and work really quickly. Mounting an Elastic File System with the episodes already loaded from
filebase
. This kinda worked, but I pushed it too hard. The problem was that FFMPEG for this encoding could only use 4 threads, and 3 cores, so I spun up a 96
core instance, and then started a few dozen conversions in parallel. It was probably the equivalent of busting out a blowtorch to power wash your driveway. I had no idea how many videos at once was too much, so I got as close as I could to 90%
CPU usage. It also cost me
$4
an hour to run. I ran it for about 3 hours before it stopped encoding all together. For some reason it got each episode to about 50MB
before dying. Seems to be about a quarter to a third of the episode. Not sure why it failed, and I'm not tempted to try again with such a big instance. The odds of me forgetting about it being on and falling asleep to a $200+ bill are high.Total investment:
$25
This does bring up the question, if the huge AWS EC2 instance cost me
$4.10/h
and it was transcoding (unsuccessfully) 30 episodes at once, and there are over 250 episodes, and it took about four hours to fail at 25%
... That's $524.8
for encoding 250 episodes. That's insane. I really didn't want to spend a bunch of time transcoding video this weekend. I really didn't want to feel the bourdon of managing close to 1/3 of a Terabyte on my Macbook's 1 Terabyte hard drive on a TV show that came out in September 17, 1972
.Time to scale.
Let's try the other extreme.
My new idea is to create 251 small instances.
Each of these instances is being launched from an AMI I configured with credentials to my filebase account. Each of the instances will have
rclone
, ssh
, and ffmpeg
installed.On initialization each instance will execute the following script with an iterator as an argument:
#!/bin/bash # First argument ID=$1 # Read the list into a variable[list] readarray -t list < /home/4077/encoder/list.txt # Use the instance id (itorator) as the postition EPISODE=${list[ID]} # Print out the episode name for cloudinit logs echo $EPISODE echo "Starting file download." rclone copy filebase:BUCKET_URL/${EPISODE}.mkv /home/4077/encoder/downloads/ echo "Finished file download." echo "Starting conversion." ffmpeg -i /home/4077/encoder/downloads/${EPISODE}.mkv -vcodec libvpx-vp9 -b:v 1M -acodec libvorbis /home/4077/encoder/output/${EPISODE}.webm echo "Finished conversion" echo "Starting file download deletion." rm /home/4077/encoder/downloads/${EPISODE} echo "Finished file download deletion." echo "Starting file upload." rclone copy /home/4077/encoder/output/${EPISODE}.webm filebase:BUCKET_URL/web/ echo "Finished file upload."
For reference,
list.txt
is a list of the episode filenames, without an extension.S01E01 S01E02 E01E03
These instances will be provisioned by writing infrastructure as code. In this case, Pulumi.
I can write a quick python script that will launch each instance with the pre-configured AMI, and run the bash script above as a cloud-init script.
import sys import json import os import provisioners import subprocess import pulumi import pulumi_aws as aws from pulumi import automation as auto from pulumi_aws import s3 # Get the config ready to go. config = pulumi.Config() # If keyName is provided, an existing KeyPair is used, else if publicKey is provided a new KeyPair # derived from the publicKey is created. key_name = config.get('keyName') public_key = config.get('publicKey') # The privateKey associated with the selected key must be provided (either directly or base64 encoded), # along with an optional passphrase if needed. def decode_key(key): try: key = base64.b64decode(key.encode('ascii')).decode('ascii') except: pass if key.startswith('-----BEGIN RSA PRIVATE KEY-----'): return key return key.encode('ascii') private_key = config.require_secret('privateKey').apply(decode_key) secgrp = aws.ec2.SecurityGroup('secgrp', description='Foo', ingress=[ aws.ec2.SecurityGroupIngressArgs(protocol='tcp', from_port=22, to_port=22, cidr_blocks=['0.0.0.0/0']), aws.ec2.SecurityGroupIngressArgs(protocol='tcp', from_port=80, to_port=80, cidr_blocks=['0.0.0.0/0']), ], egress=[aws.ec2.SecurityGroupEgressArgs( from_port=0, to_port=0, protocol="-1", cidr_blocks=["0.0.0.0/0"], ipv6_cidr_blocks=["::/0"], )], ) ami = "ami-XXXXXXXXXXXXXXXXXXXX" if key_name is None: key = aws.ec2.KeyPair('key', public_key=public_key) key_name = key.key_name episodes = 251 for i in range(0, episodes): user_data = f""" #!/bin/bash /home/4077/encoder/encoder.sh {i} """ encoder = aws.ec2.Instance(f"encoder-{i}", ami=ami, instance_type="t2.large", user_data=user_data, key_name=key_name, vpc_security_group_ids=[ secgrp.id ], associate_public_ip_address=True, tags={ "Name": f"{i}-encoder", })
This approach will:
- Let me start an instance in a size of my choosing, and run one command on that instance to start the conversion, and let me snoop on the progress over SSH
- Not have to download every episode and store it on EFS beforehand, which costs $$$.
- Control the entire deployment pipeline of each instance using a simple [pulumi](https://www.pulumi.com/) script and a for loop.
Little swarms of computers
I tested this out with 100
t2.medium
instances to start with.
Creating and destroying all the resources using Pulumi was a breeze.
I could bring up and down these instances a hundred at a time, and watch the instances do their thing with some cool metrics.
Each instance took
12 hours
to finish the average 22 minute
M*A*S*H episode.About six episodes are double features, so those took about
16 hours
to finish.I was able to convert every episode of the show into
webm
format and it only took me 3 days
.Continuation
This project has continued to be a side project of mine.
I have collected data from each episode's wiki pages and compiled that information into a web application that can play any and all episodes of MASH for my binging consumption.
I really enjoy having this show on in the background while working. It's a nice comfort show for passing the time. I often forget it's on, and then I smile, because that's what it's there for. Then I rewind!
All in all, I ended up paying about
$1.90
an episode to transcode the video. The project costs in total about $500
The monthly costs for bandwidth from streaming the episodes has been under
$2
a month while using decentralized storage CDNs.I'd say that I learned a lot from the experience, and I have some ideas on how I could improve the project in the future, or when I decide to build upon my encoding solution. Perhaps next time I'll try using a bare metal instance and a single Micro VM per episode and compare the results.
Variables to further investigate:
- Instance Size
- Multithreading/Multicore processing
- ffmpeg options
- Best quality
- Best speed
- Best resolution
- Compromise
- Highest CPU utilization
- GPU-Intensive video transcoding?
Screenshots
The website is also a Progressive Web Application so you can install it from your chrome browser!