M*A*S*H Archive

MAS*H Archive

Problem

I have about 180GB of MASH episodes in .mkv, .mp4, the transcripts of each episode in two different formats, and I need a way to convert the video from .mkv, .mp4 into .webm for easy web playback in chrome. This makes about 360GB of MASH episodes I have stored.

After trying to find a compatible web media player, I realized that the .mkv format is not suitable for streaming in a browser. The .mp4 file type is also not in the right codec for browser playback.

That, fucking sucks.

Encoding video is expensive and time consuming.

But I've always wanted to learn more about the process.

What I didn't want to do, is pipe this TV show into AWS's custom-built video transcoding solutions.

That just seems like a bad idea.

I already probably shouldn't be putting DVD rips in cloud storage.

I can hack my way through this.

Solutions

Start up a huge ec2 instance and work really quickly. Mounting an Elastic File System with the episodes already loaded from filebase. This kinda worked, but I pushed it too hard. The problem was that FFMPEG for this encoding could only use 4 threads, and 3 cores, so I spun up a 96 core instance, and then started a few dozen conversions in parallel. It was probably the equivalent of busting out a blowtorch to power wash your driveway. I had no idea how many videos at once was too much, so I got as close as I could to 90% CPU usage.

It also cost me $4 an hour to run. I ran it for about 3 hours before it stopped encoding all together. For some reason it got each episode to about 50MB before dying. Seems to be about a quarter to a third of the episode. Not sure why it failed, and I'm not tempted to try again with such a big instance. The odds of me forgetting about it being on and falling asleep to a $200+ bill are high.

Total investment: $25

This does bring up the question, if the huge AWS EC2 instance cost me $4.10/h and it was transcoding (unsuccessfully) 30 episodes at once, and there are over 250 episodes, and it took about four hours to fail at 25%... That's $524.8 for encoding 250 episodes. That's insane. I really didn't want to spend a bunch of time transcoding video this weekend. I really didn't want to feel the bourdon of managing close to 1/3 of a Terabyte on my Macbook's 1 Terabyte hard drive on a TV show that came out in September 17, 1972.

Time to scale.

Let's try the other extreme.

My new idea is to create 251 small instances.

Each of these instances is being launched from an AMI I configured with credentials to my filebase account. Each of the instances will have rclone, ssh, and ffmpeg installed.

On initialization each instance will execute the following script with an iterator as an argument:


#!/bin/bash

# First argument
ID=$1

# Read the list into a variable[list]
readarray -t list < /home/4077/encoder/list.txt

# Use the instance id (itorator) as the postition  
EPISODE=${list[ID]}

# Print out the episode name for cloudinit logs
echo $EPISODE

echo "Starting file download."
rclone copy filebase:BUCKET_URL/${EPISODE}.mkv /home/4077/encoder/downloads/
echo "Finished file download."

echo "Starting conversion."
ffmpeg -i /home/4077/encoder/downloads/${EPISODE}.mkv -vcodec libvpx-vp9 -b:v 1M -acodec
 libvorbis /home/4077/encoder/output/${EPISODE}.webm
 echo "Finished conversion"

 echo "Starting file download deletion."
 rm /home/4077/encoder/downloads/${EPISODE}
 echo "Finished file download deletion."

 echo "Starting file upload."
 rclone copy /home/4077/encoder/output/${EPISODE}.webm filebase:BUCKET_URL/web/
 echo "Finished file upload."

For reference, list.txt is a list of the episode filenames, without an extension.


S01E01
S01E02
E01E03

These instances will be provisioned by writing infrastructure as code. In this case, Pulumi.

I can write a quick python script that will launch each instance with the pre-configured AMI, and run the bash script above as a cloud-init script.


import sys
import json
import os
import provisioners
import subprocess
import pulumi
import pulumi_aws as aws
from pulumi import automation as auto
from pulumi_aws import s3

# Get the config ready to go.
config = pulumi.Config()

# If keyName is provided, an existing KeyPair is used, else if publicKey is provided a new KeyPair
# derived from the publicKey is created.
key_name = config.get('keyName')
public_key = config.get('publicKey')

# The privateKey associated with the selected key must be provided (either directly or base64 encoded),
# along with an optional passphrase if needed.
def decode_key(key):

    try:
        key = base64.b64decode(key.encode('ascii')).decode('ascii')
    except:
        pass

    if key.startswith('-----BEGIN RSA PRIVATE KEY-----'):
        return key

    return key.encode('ascii')

private_key = config.require_secret('privateKey').apply(decode_key)

secgrp = aws.ec2.SecurityGroup('secgrp',
    description='Foo',
    ingress=[
        aws.ec2.SecurityGroupIngressArgs(protocol='tcp', from_port=22, to_port=22, cidr_blocks=['0.0.0.0/0']),
        aws.ec2.SecurityGroupIngressArgs(protocol='tcp', from_port=80, to_port=80, cidr_blocks=['0.0.0.0/0']),
    ],
    egress=[aws.ec2.SecurityGroupEgressArgs(
                from_port=0,
                to_port=0,
                protocol="-1",
                cidr_blocks=["0.0.0.0/0"],
                ipv6_cidr_blocks=["::/0"],
    )],
)

ami = "ami-XXXXXXXXXXXXXXXXXXXX"

if key_name is None:
    key = aws.ec2.KeyPair('key', public_key=public_key)
    key_name = key.key_name

episodes = 251
for i in range(0, episodes):
    user_data = f"""
    #!/bin/bash
    /home/4077/encoder/encoder.sh {i} 
    """
    encoder = aws.ec2.Instance(f"encoder-{i}",
                                 ami=ami,
                                 instance_type="t2.large",
                                 user_data=user_data,
                                 key_name=key_name,
                                 vpc_security_group_ids=[ secgrp.id ],
                                 associate_public_ip_address=True,
                                 tags={
                                     "Name": f"{i}-encoder",
                                 })

This approach will:

Let me start an instance in a size of my choosing, and run one command on that instance to start the conversion, and let me snoop on the progress over SSH

Not have to download every episode and store it on EFS beforehand, which costs $$$.

Control the entire deployment pipeline of each instance using a simple [pulumi](https://www.pulumi.com/) script and a for loop.

Little swarms of computers

I tested this out with 100 t2.medium instances to start with.

Creating and destroying all the resources using Pulumi was a breeze.

I could bring up and down these instances a hundred at a time, and watch the instances do their thing with some cool metrics.

Each instance took 12 hours to finish the average 22 minute M*A*S*H episode.

About six episodes are double features, so those took about 16 hours to finish.

I was able to convert every episode of the show into webm format and it only took me 3 days.

Continuation

This project has continued to be a side project of mine.

I have collected data from each episode's wiki pages and compiled that information into a web application that can play any and all episodes of MASH for my binging consumption.

I really enjoy having this show on in the background while working. It's a nice comfort show for passing the time. I often forget it's on, and then I smile, because that's what it's there for. Then I rewind!

All in all, I ended up paying about $1.90 an episode to transcode the video. The project costs in total about $500

The monthly costs for bandwidth from streaming the episodes has been under $2 a month while using decentralized storage CDNs.

I'd say that I learned a lot from the experience, and I have some ideas on how I could improve the project in the future, or when I decide to build upon my encoding solution. Perhaps next time I'll try using a bare metal instance and a single Micro VM per episode and compare the results.

Variables to further investigate:

Instance Size

Multithreading/Multicore processing

ffmpeg options

Best quality
Best speed
Best resolution
Compromise

Highest CPU utilization

GPU-Intensive video transcoding?

Screenshots

The website is also a Progressive Web Application so you can install it from your chrome browser!