How personalization crashed BBC iPlayer at the height of lockdown, and how it was fixed

How personalization crashed BBC iPlayer at the height of lockdown, and how it was fixed

Manisha Lopes, principal software engineer at the BBC, spoke to developers at QCon London this week, describing how a personalization service used by the iPlayer video streaming application crashed the service when usage peaked during lockdown.

The popularity of BBC iPlayer increased 61 percent during lockdown in May 2020, with the biggest day being May 10th when the UK prime minister broadcast a 7pm statement. “What happened when people came to us in record numbers? iPlayer crashed,” said Lopes. “At around 7pm there were a lot of errors coming out of UAS.”

This was traced to “a huge spike on DynamoDB,” which exceeded its provisioned capacity. The errors “cascaded to iPlayer which resulted in the crash,” Lopes told QCon.

Lopes defended the use of personalization, which requires user log-in. “We have seen how important personalization is to all of us. No industry has remained immune to it. Groceries, drugs, websites, travel, leisure, you name it,” she told attendees. She said that a video streaming service with no recollection of user preferences would not meet expectations.

The BBC’s iPlayer personalization depends on a project called the User Activity Service (UAS), which Lopes described as a “real-time service that remembers a person’s activity while they are using the BBC.” This service supports between 15,000 – 30,000 transactions per second and stores around 150 million “activities” per day. The data is used for around 75 “user experiences” across BBC products, said Lopes.

The UAS runs on virtual machines, Lopes said, and makes use of a queuing service as well as a NoSQL database, AWS DynamoDB. Data is sent asynchronously to UAS when a user takes an action in iPlayer.

In response, the BBC engineers ran a simulation of the incident, to better understand it, and came up with the idea of a “circuit breaker” pattern, which will monitor the success of requests to a service and, if it fails, switch to a backup service or stop sending the requests. This meant that iPlayer could degrade gracefully instead of crashing, showing a message stating that the service was non-personalized.

The BBC initially used the AWS “classic” load balancer which Lopes said was good for gradual changes but “not great for spikey traffic.” This was switched to a combination of the Application Load Balancer and the Network Load Balancer which worked better. Another AWS tip was to migrate to new generation instances on EC2 (Elastic Computer Cloud) for better performance.

What about DynamoDB though? “Although we had auto-scaling on DynamoDB it was not scaling up fast enough to handle incoming traffic,” said Lopes. DynamoDB has two modes: provisioned, and on-demand. With on-demand mode, AWS charges per data read or write, and handles demand automatically, whereas provisioned mode is generally cheaper but does not scale so easily. “On-demand is seven times the cost of provisioned,” said Lopes. “In the BBC we have to operate within the limits of a restricted budget.”

The corporation therefore decided to stick with provisioned mode. The engineers scrutinized the data model instead, discovering that it was not properly aligned with the requirements. The indexing was also fine-tuned. The outcome was to reduce the amount of data fetched which in turn improved performance.

Despite the claimed advantages of personalization, a service that works in degraded form is better than one that does not work at all. “Have a good understanding of the complete use case,” said one of Lopes’ slides, distinguishing between the critical versus the non-critical path.