Building Streaming Video Apps With HTML: Technical Considerations

Erik Uggeldahl
Erik Uggeldahl
You.i TV_HTML_OTT_streaming_TV_and_Media_Technical Considerations_blog thumbnail

In the Media & Entertainment industry, we have the unlucky situation of needing to handle quite a few more platforms than other neighboring industries. Consider that for most app developers, mobile and web are sufficient, meaning at most three codebases assuming native iOS and Android, though with diligent mobile web best practices this can all be folded under one web codebase. For most TV-connected devices like Apple TV or Amazon Fire TV, it really is only the media consumption use case that is compelling. The interaction medium of the remote, the distance from the screen, and the fixed location of the device limit other applications.

At You.i TV we’ve observed that deviances between platforms, or “fragmentation”, has an exponential effect on complexity, rather than simply linear. That is to say, each factor that a developer must account for is multiplicative with the others rather than additive. Unfortunately, these problems are unavoidable for those in the OTT space. The expectation of the end-user is that it should not matter which device I have purchased (or was gifted!), I should be able to watch your content. This implies somewhere on the order of 5-15 platforms, depending on your market and willingness to chase the tail of the distribution. In North America, for example, OTT devices such as Roku, Android TV, Apple TV, Amazon Fire TV, and next-gen consoles in PlayStation and Xbox are household staples. We’ve also seen that willingness to circumvent the norm of covering all platforms can entail risk with user reception. Take for example Quibi, which launched initially only mobile, later delivering on TV-connected devices around the same time as their closure.

Thus it falls to engineering management and teams to choose a technology strategy that will determine many factors for the upcoming development cycle of a streaming video app. Product Management will no doubt have high expectations. It might not be until mid-development that the downsides of a selection become apparent, just as you’re scaling from the first API connections, authentication, and playback into the full app experience.

With this pressure to choose correctly, it isn’t unreasonable for us to fall back on what we know. For many front-end developers that means either web or native. In this blog post, I’ll be focusing on the former, though it’s important to note that for many platforms, such as Samsung Tizen and LG webOS, they are in fact one and the same!

Specifically, I’d like to argue that because we often choose technologies that we are comfortable with, we are at risk of becoming blind to their drawbacks. We become accustomed to thinking of them in aggregate, rather than with the individual nuance that is inherent to cross-platform development. Let’s examine web (or HTML) development for multiscreen under this lens.

The Allure of Web

We can begin by addressing why it’s compelling to choose a fully web-based stack. Firstly, it’s free, and already that’s a great proposition. Web development has been built up from the combination of many tens of thousands of development hours from a diverse community with diverse incentives. The last decade has seen stabilization of the JavaScript fatigue problem, where choices such as NPM, TypeScript, Webpack, ESLint, React, and Babel aren’t controversial and have little in the way of equal competitors (maybe I’m being too generous here, but State of JS 2019 has interesting trend data).

Web is often one of the first “must-have” platforms, followed by mobile. As such, hiring for web front-end engineers is a routine prospect. Many bootcamps and online courses are focused on producing readily skilled applicants in the most common front-end stacks.

Common problems are well documented on StackOverflow, and APIs are plumbed on MDN or specific websites, such as a personal favorite, Flexbox Froggy. GitHub issues track decisions made and planned progress in a transparent manner. And for whichever technology most interests you, you’re likely to find a community and even conference to suit you.

With these benefits in mind, it’s natural to gravitate towards web as a ubiquitous solution to app development.

Platform Reach

The first problem arises when considering how many of our target platforms we can reach using web technologies. Let’s enumerate:

Additionally, you may encounter proprietary solutions with either scant documentation or documentation behind NDAs, such as for PlayStation, Vewd, or set-top boxes. Since we cannot discuss these with public information, we’ll focus on the above list, but be aware that items identified below will likely be exacerbated in these closed-door platforms.

Notably absent in this list are the Apple TV and Roku platforms, the latter holding a significant market share in North America. If you plan to have a presence on these two platforms, you may be looking at two additional codebases. For Roku, you’re looking for elusive BrightScript developers on a separate team or contracting the work out to a third party. For Apple TV, it’s tempting to look at TVMLKit and consider it part of your web strategy, but further inspection reveals that it is quite divorced from web development, bearing only the similarity of using JavaScript. This means, similar to Roku, hiring Apple-focused engineers on a separate team, perhaps to share with iOS on a combined codebase, or contracting the work.

Recommended: How We Took React Native to Roku by Playing in the Clouds

Disparate Platform APIs

Assuming that we go forward with the above list while handling Roku and Apple TV as exceptions, the next step is to scrutinize how much codeshare we can achieve under a single umbrella codebase.

Inspecting the documentation of the above, many platforms present unique APIs to access functions on the device, as such functionality hasn’t been commonly ratified under the W3C and is device-specific.

  • App lifecycle particular to a given operating system and with different policies for backgrounding, resuming, and terminating
  • Handling input specific to the capabilities of the device’s input method, such as LG’s “Magic Remote” which acts as a pointer, or the various bindings to the back and media buttons
  • Similarly, display and behavior of the keyboard, including keyboard types and whether it should push content on the screen to avoid hiding forms
  • In-app purchasing and advertisement APIs particular to the vendor’s policies
  • File I/O, storage policy, cookies, and local database availability may vary
  • Network availability, HTTP/2 support, IPv6 support, TLS version, and cache policy and size, including whether and how it can be configured
  • Service workers, whether they exist and are threaded, and how they interact with the main app
  • Device information, such as unique IDs, MAC addresses, and versions
  • Expectations and capabilities of:
    • Offline usage
    • Deep linking
    • Voice interface
    • Accessibility
  • Ability to host all or portions of the app to update dynamically, and how that is arranged

While this list is not exhaustive, it should provide a sense of the scope of the problem. There’s also one major rabbit hole that deserves special consideration, that being the media playback which lies at the heart of our streaming video app. For that, we can compose a list unto its own.

  • Supported streaming formats (HLS, DASH, Smooth) and encodings (H.264, HVEC, AV1, etc.)
  • Media Source Extensions (MSE) and Encrypted Media Extensions (EME) support
  • Supported DRM formats (Widevine, PlayReady, FairPlay, AES encryption, other)
  • Support for:
    • Multiple audio tracks
    • Multiple subtitle tracks and their formats
    • Trickplay
    • Adaptive bitrates and Quality of Service (QoS) analytics
    • Ads, Client-Side Ad Insertion (CSAI) or Server-Side Ad Insertion (SSAI)
  • Interface to steaming metadata
  • Ability to position and animate the video as an element on the screen, e.g. within list items
  • Ability to pop the video out as an overlay for Picture-in-Picture (PiP)
  • Default player chrome and ability to skin

The combination of these two lists poses a daunting challenge for development teams in a multiscreen capacity. For each inconsistency between platforms, we will need to craft an appropriate abstraction that isolates it from the rest of the system. If we go by the venerable Joel Spolsky’s law, “All non-trivial abstractions, to some degree, are leaky”, meaning unavoidably some platform-specific details will escape our abstraction into our common codebase. The quality of our abstractions will be a function of how much time and experience we apply to them, and if we apply too little, the leak will become a metaphorical flood, drastically increasing the complexity of our app code for routine operations.

As web developers, we may be accustomed to browsers handling the brunt of cross-platform concerns, but the unfortunate reality is that this buffer is greatly diminished on OTT devices and we will need to adjust expectations accordingly.

The Evergreen Ecosystem

Similarly, in the last decade, we have grown accustomed to browsers adopting an evergreen model, meaning that nearly all users will be at or close to the latest versions of their browsers. The days of Internet Explorer-specific compatibility are largely behind us, as well as the necessity for Babel to polyfill large pieces of absent JS functionality. Browsers are quick to adopt new W3C standards, going so far as to implement new features in ECMAScript and CSS that are still in their proposal stages.

Because of this reality, the NPM ecosystem has been built up with this in mind. Package developers, by and large, are more concerned with leveraging modern features than with compatibility with older browsers which hold diminishing market share. This effect also applies transitively, as even if a package doesn’t use a modern API, one of its dependencies, or a dependency of its dependencies may do so in a way that isn’t readily apparent.

When we turn our attention back to the platforms that fall under our web strategy, this can pose a massive problem. The internal browser version of our devices is not guaranteed to be at the same level of compatibility as an evergreen desktop browser, and in fact, explicitly isn’t in many cases. Consider for instance Samsung Tizen TVs, which feature different versions of either WebKit or Chromium depending on the year and partial or absent support for W3C APIs, again dependent on the year. The same can be said for LG webOS TVs. Sometimes this information isn’t collocated and becomes difficult to track down, such as Fire TV mentioning lack of Blob (binary data) support in an FAQ page while mentioning the lack of WebGL support on stick devices as a footnote on their Getting Started page.

Under these conditions, it makes it difficult to make any assumptions around which packages will be viable to use in our project. This ranges from small convenience wrappers all the way to framework-level choices such as React, Vue, and Angular. The best hope we have is to get all platforms integrated into a Continuous Integration (CI) system with smoke tests that can identify build-level errors when adding a new package. This also implies that we have chosen to build all of our platforms simultaneously, as we wouldn’t want to make the mistake of adding a dependency only to find out it isn’t compatible when later starting development on a new platform and having to excise it from the codebase. It also doesn’t protect us from runtime errors, which will also need to be automation tested as we use more of the API surface of a given library, nor from updates to the package or the package’s dependencies altering our stable state.

Of course, given that these packages are open source, nothing is to stop us from filing issues and requesting changes, but we cannot assume any level of timeliness or commitment for support here, especially if our issues are edge cases due to running on a device. Alternatively, we may consider patching and perhaps upstreaming any fixes, but that can be a hefty time sink that distracts from the realities of delivering our product to the market.

It then becomes reasonable to avoid potential dependency nightmare scenarios by avoiding most or all dependencies altogether. This has the obvious downside of mitigating one of the greatest strengths of web development, but also imposes a different kind of burden on the development team. Packages are included to solve problems, and in their absence, these problems still need solving, only this time using homegrown solutions. While these often have a smaller footprint and can be fully understood by the team that built them, they must also be rigorously tested for cross-platform compatibility and lack the benefit of battle-tested rigor that often only comes over time and experience.

Likely the correct approach is somewhere between these two. It will mean choosing fewer dependencies than we would typically when developing for desktop web, and highly scrutinizing those choices, while also being open to building and maintaining where necessary. And of course, the door is always open to negotiating scope with product owners to drop a feature entirely if it imposes undue risk on the project!

Developer Experience Upsets

Next we must consider the impacts on our developer experience. When working with devices instead of browsers, we will need to make some concessions compared to our typical workflow.

The first is packaging and builds. Unlike websites that can be served from a local development server, web apps for devices must be packaged with a manifest format particular to that device’s app store requirements, as well as any additional signing, permission, and visual asset requirements. These packaging requirements differ significantly per platform and will need to be maintained as individual build paths that ensure we can build a valid package for a given platform at any time, especially for the sake of a working CI system. Developers will need to ensure they have the correct signing identities, environment variables, paths, and build toolchains setup correctly for local builds. Investment here is important as, if done haphazardly, it can become a significant time loss over the lifespan of a project.

When testing, we cannot assume that a web browser’s representation of our app is accurate to what we will see on a device. As our project matures, this approach may become impossible unless the device API abstractions discussed above are mocked to work on browsers as well. Thankfully, some platforms have emulators, but these will require installation of their SDKs and understanding of how to deploy to these emulators in an automated fashion. Others will require physical hardware put into developer mode, either shared by the development team or per seat, especially considering the constraints of our current distributed lifestyle.

Connecting a debugger to a device will be particular to a given platform. We cannot assume hot or live reloading will work. Console logs may be exposed in a variety of ways, including obscure physical key combinations or through remote CLI commands.

We must decide if the CI system will deploy to real devices, potentially in a dedicated device lab, to ensure smoke and automation tests are accurate. While a significant investment to set up, the alternative exposes us to the risk that our mainline branch has incompatibilities that will go unnoticed. This also raises the question of which devices to maintain in a device lab, which by proxy asks which models of a given platform we’re willing to support. For instance, will we go hunting for 2016 model LG and Samsung TVs, older Fire TV sticks, Android TV devices such as the Mi Box and Nvidia Shield, and a collection of iOS and Android handsets? If not, how will we ensure that these work correctly when going to market?

Adjusting for User Experience

If we intend to unify web, mobile, and TV (or “10ft” due to the distance between user and screen) under one codebase, we must make arrangements for the very distinct form factors between them. The best practices of adjusting between desktop and mobile web are reasonably well established, such as the interaction target size between a pointer and tap, navigation paradigms, information and typography density, viewport responsive scaling, accounting for high-density displays using device-independent pixels (DIPs) and respecting dark-mode settings. Libraries and samples are readily available to accommodate for these differences.

Unfortunately, the same cannot be said when adding 10ft to the mix. Consider the following list of user experience (UX) topics that we will need to solve for in a unified way with our mobile and web codebase:

  • Element focus. There must always be exactly one element in focus and it must be clearly and consistently articulated to the user. Navigation must be arranged to minimize the number of clicks to shift focus to a desired item. It must be clear to the user which item will be next for a given direction. Modals and other overlays must correctly capture focus to avoid focusing on non-visible elements.
  • TV safe area – a region around the edges of the screen that must be left largely devoid of content to account for potential overscan. We may be able to combine this with accounting for mobile notches.
  • Screen density. Similar to high-density displays, we must account for 720p, 1080p, and 4k display density differences, including potentially having different assets for each breakpoint.
  • Information density. 10ft displays are the least forgiving for information density. Users typically reject experiences that require too much reading at distance. Instead, information is usually condensed and, if possible, represented visually such as with icons or images. Text and other UI elements must be large enough to be legible at a distance, limiting the amount of information on a given page.

The above list will in practice result in a breakpoint for 10ft designs to accommodate the needs of those platforms. If your team works with a design system, you may be able to reuse elements of the system, sometimes referred to as “atoms” or “molecules” if they can be scaled appropriately. However, when it comes to macro-level design, navigation, and page layout, it may be best to involve separate code paths for 10ft. Similar to device APIs, we will want to be sure that these splits are architected correctly so as to ensure the robustness of the overall codebase.

Accounting for Performance

When working with the web, we must account for the Document Object Model (DOM) as a potential bottleneck for our performance. As the DOM was originally architected for static document display, dynamic elements such as CSS3 animations, canvas and WebGL features, and single-page applications which change their structure regularly can have varying degrees of support and baseline performance. Since we’ve already discussed feature support, let’s instead turn our attention to performance.

From the device coverage mentioned above, we run into a number that has limited hardware to power our web app. For instance, the Fire TV Stick (Generation 2), available from 2016-2019 and a popular gift item, has 1GB of memory and a quad-core 1.3 GHz processor, though we have to assume that most of our bottlenecks will be on the main thread given that we’re working with the web and JavaScript, meaning we will be limited to mostly one core. It is also not clear how much hardware is available to the app developer versus being reserved for the operating system for core functionality.

With smart TVs, the available information can be much more scarce. Examining a 2017-era LG TV, for example, it’s not easy to find information on its hardware profile. But with some searching we may come across the getSystemInfo API, which lists a chipset “M14”, perhaps meaning the year 2014. Further searching reveals the hardware specifications for signage, which has information on an M16 chipset found in 2016 and 2017 era digital sign hardware. It isn’t a stretch to imagine these are the same chips used in TVs, but the fact that the information isn’t readily available is concerning. If this assumption is correct, we are working with 2GB of memory and a quad-core 1.0 GHz processor.

On the opposite end of the spectrum is the NVIDIA Shield TV Pro, using the Tegra X1+, an updated version of the same processor found in the Nintendo Switch, featuring a quad-core 1.9 GHz processor coupled with 3GB of memory.

These varying and sometimes difficult to track down hardware profiles pose a problem for developers looking to provide a smooth experience for all users. Two symmetrical but opposite techniques are graceful degradation and progressive enhancement. In the former, a premium experience is first created targeting the best hardware profile and is then pared back for devices that cannot handle certain features. In the latter, we start at the lowest profile and enhance as hardware allows. Importantly, the codebase must be architected in a way that either of these approaches does not significantly increase complexity.

No advice on performance would be complete without repeating the mantra of measuring before and after all intended performance improvement changes. The alternative path is sure disaster as often performance is unintuitive, especially as the device hardware we are working on can have different optimal paths than our experience would lead us to believe. Examples of this include having tile-based integrated GPUs which behave very differently from desktop-class GPUs, or using ARM instruction sets, including NEON for SIMD, which varies significantly from x86 and AVX.

To further complicate the matter, the DOM’s operations are opaque and the implementation of its operations may vary between browsers and browser versions. The same can be said for the accompanying JavaScript interpreter, whether it be JavaScriptCore, V8, or Hermes. A given JavaScript function may not be just-in-time (JIT) compiled the same way or at all, and the progressive JIT approach may require a different number of iterations to reach the highest levels of optimization. Frameworks may also add a layer of abstraction, such as React’s virtual DOM, which will create a different set of best practices.

As app developers, we must balance between framerate, time to interaction (TTI), memory consumption, install size, and if on mobile, energy consumption. As we learn early in computer science with the time/space tradeoff, improving one of these metrics may negatively impact another. The Hermes introduction blog post shows that when these factors are considered explicitly, we can make the right set of tradeoffs for the realities of our hardware.

Under these extremely varied and uncertain circumstances, our best course of action is to take an empirical approach. Ideally, we will have set up our CI device lab mentioned earlier, and this can also be used to create continuous performance audits across a range of devices. These will ensure incoming code changes don’t inadvertently crater performance, or that well-meaning performance enhancements do in fact improve across multiple profiles, or at the very least can be contained to those device categories which do see improvements. As with any CI setup, the process should be automated to each submission, flagging regressions that exceed a threshold and notifying performance subject matter experts.

Closing Remarks

A squinted view may be tempted to see the web as a unified development platform. At least on media devices, I hope the above has outlined that while the abstraction may be unified, the varying implementations necessitate a more diligent, methodical approach compared to desktop and even mobile web development.

This is also the time to close out with an alternative. At You.i TV, our You.i Engine One SDK is much closer to a gaming engine than it is to a web browser. We believe that this represents a better, lower level of abstraction suitable for the array of circumstances under which it needs to operate. Unlike the DOM, our scene tree (DOM equivalent) is exposed as a singular C++ API underlying a React Native JavaScript surface. When appropriate, we encourage developers to write at this C++ layer to achieve native optimizations that might be difficult to achieve with the web. Examples of this may include using C++ arena or pool allocation to preload large numbers of repetitive items, or writing a shader in GLSL that achieves an effect on the GPU avoiding precious main thread execution time. Additionally, due to being a singular and more bare implementation, performance is easier to reason about and the likelihood that a change will improve all platforms increases.

This approach may not be suitable for everyone. Over the course of dealing with a number of large brands, we find the approach works best at scale when targeting multiple platforms across multiple hardware profiles. Regardless of your strategy, it is best to consider the scale of your delivery early and set up appropriate automations to manage that complexity.

Here’s something similar we think you’ll enjoy.