If you want to know the “holy grail” for data scientists, I’ll tell you:
Predicting box office performance of movie scripts.
Here’s how it goes. An aspiring data scientist—ranging from bright undergraduate in computer science to a Ph.D. candidate in statistics to even tenured professors—looks for a new topic. They’re bored by analyzing mortgage applications and discover that no one is very good at predicting box office for movies. So they say to themselves, “I can do that.”
Sometimes they even create a model and/or publish papers. Then they go to the Hollywood studios and claim they can use an analysis of a script to predict box office success. Often this is touted alongside advanced analytics, machine learning and neural networks, or other similarly jargon.
We shouldn’t shame these data scientists for trying, though. I mean, the executives at streaming services like Netflix and Amazon Studios/Prime/Video claim they can also use complicated algorithms to predict how well they pick TV shows or movies. Both those streaming video platforms are constantly asked about—and they in turn release vague hints about—the data and algorithms they use to pick TV series.
I have also fielded those types of questions since I helped work on strategy at a streaming platform with tons of data, as I mentioned in my second post “Theme 1: It’s about decision-making, not data”. It typically went, “With all the customer viewing data, how did you use that to pick TV shows?” In my initial post, I specifically didn’t answer the question, but went off on a tangent.
But it is worth answering, because it will illuminate a common Entertainment Strategy Guy theme, “Be skeptical”. In this case, “Be skeptical” of the streaming services claiming they have esoteric data knowledge and the entertainment journalists who let them repeat this.
Of course, I don’t blame the executives per se for claiming they have complicated algorithms. I blame the journalists who repeat it without questioning it. These media members don’t probe that audacious statement. A quick push will reveal those statements to be a house of cards, if you will. (Wow, brutal pun.) In reality, Netflix/Amazon/Hulu/other streaming services and traditional studios don’t have enough data to actually use data to help them make decisions.
So let’s push back, just a bit.
Unfortunately, this pushing will involve statistics.
(Don’t run away. I’m not going to go too deep into the statistics. But it’s worth covering it just a bit, so we can understand that when execs tout the “data” we understand that basic statistics argues against them. And this statistical analysis isn’t even really an analysis, just a simple explanation of the data situation. It’s more about the immutable nature of data, as opposed to some fancy algorithm or data analysis.)
Statistics at its most basic is about gathering data points. The more data points you have, the more accurate your analysis of a situation. In statistics, this is called your “n”. N equals the size of your sample or a population. The important point is, as “n” gets larger, it gets more accurate. You can make predictions.
(If you want more than this simple explanation, I really do recommend the Cartoon Guide to Statistics. My b-school recommended it and I loved it.)
Some real world examples. Credit card companies have issued hundreds of millions of credit cards to hundreds of millions Americans. They process multiple transactions per customer. That’s potentially billions of data points (“n’s”), with potentially billions more variables describing how the transactions worked. That’s big data. That’s a data rich environment.
On the other hand, look at Presidential elections. As Nate Silver will point out, there are many fewer “n’s” on this field. Since 1940, and the dawn of modern polling, there have been only 19 Presidential elections. That’s small data. That’s a data poor environment.
“Filmed entertainment”—my preferred catch-all for TV and movies—is a data poor environment.
Movies have more “n’s” than TV shows—over 500 released each year—but many of those are very, very small independent films. TV shows just passed over 400 scripted shows per year, but the bulk of those are continuations of previous seasons, which means it’s hard to count each as an independent, uncorrelated example.
Blast, I just dropped a few more statistics terms (independence, correlation) without explaining them.
See, I opened talking about “n’s”, or sample size. Those are unique examples. But each example has its own set of “variables”. These variables show why the “picking of TV shows using “complicated algorithms” is so hard.
(I realize I am explaining a lot of basic statistics here. I hope that everyone can either a. Learn something or b. get a great refresher and not be c. bored by it.)
So if you take an example of single movie, that’s a data point. Let’s use an example to explain, like say, Get Out. Everything that describes that data point is a “variable”. For example, the date it was released is a variable. Or its genre, which is horror or thriller. That’s a variable. Production budget, cast and director: those are examples of variables that describe our one data point.
Some of these variables are “independent”. They aren’t affected by the other variables. For example, release date isn’t affected by the type of film. Same with the production budget. Those are independent variables we can track. Which means they influence the outcome without influencing each other. It also often means you can use them to help make predictions.
But predictions of what? Well, usually the dependent variable. This is the thing you want to try to change or find the relationship between. It often means it’s the variable you can’t control. Or it means it’s the variable that comes after the others are decided. In math terms, the dependent variable is the variable you are solving the equation for. To use our Get Out example, its performance at the box office is the most likely dependent variable (and usually what data scientists or entertainment execs are trying to predict).
This is a business, and most models are trying to predict if the movie or TV show will make money. Other dependent variables could be quality of the movie or awards potential. If you can quantify them, you can make them variables.
One more note on variables (then I promise we are moving on to debunking the question at the top). Variables come in a couple of different types, but for our purposes, the most interesting is “categorical variables”. Unlike say, production budget which is a range of numbers (from $0 to $1,000,000,000+), a categorical variable asks if something fits into a category. Back to the Get Out example. It’s rating is one of a set of 5 variables (G, PG, PG-13, R and X). Genre is also a categorical variable, and you could list potentially dozens of categories. And as I mentioned above, Get Out could be described as horror or thriller, and has been.
So here’s the key insight about categorical variables: they devastate your sample size or “n”, especially if you’re in a data poor environment.
The reason is because categorical variables can be hugely important. And if they are important enough it means that you mainly need to compare data points of the same category. So how many horror movies were released last year? Off the top of my head, I guessed 20. Using Box Office Mojo, I found 46 movies released between 2015 and 2017 described as “horror – r-rated”, about 18 per year.
If your sample size is now only those movies, all of a sudden your data set of “500 movies released last year” got a whole bunch smaller. You can see why too: you shouldn’t use kids animated movies to predict box office performance of horror films, right? The solution could be to take your data points back in time to get your sample size back up, but you are introducing a new variable, time. That’s its own unique type of variable, which can complicate things further.
So now we have all the information about statistics to propose the core challenge to Netflix, Amazon, traditional broadcasters and anyone else claiming they can use data to predict the success of TV shows. Mainly, their sample sizes are SO SMALL, that it just isn’t that reliable. Don’t get me wrong, data helps people make decisions. But when credit goes to some algorithm, it implies a level of accuracy those studios just don’t have.
I want to answer one final question that naturally occurs. Are there any filmed entertainment services that could use big data?
One, which is Youtube. See, Youtube doesn’t pay for tens of TV series, but features millions of videos. That’s a huge data set. Now, will it help them if they decide to make their own TV shows the way Youtube Red is planning? Probably not, because that’s a huge categorical variable that throws off all other dependent variables.
But does the huge amount of videos help plan other content? Yeah, it helps. And when it comes to marketing or user behavior on their platforms, Netflix, Youtube and Amazon and all streaming services have a lot of data to help market to customers. Or to improve user experience. (Which they also then use to push originals on people, but again that’s another point for another post.) And like I said, the streaming services can use algorithms—and do—to help model the prices they pay for content, especially if it premiered in a previous window.
Guess what, so do traditional networks!
But know this, data analysis is complicated and predicting the future is really, really hard. Really hard. Doing it with confidence (confidence in a statistical sense), is even harder.
Here’s my final sarcastic point for journalists attending TCAs or doing a sit down interview with a streaming service exec. The next time an exec implies he’s using data to help predict which shows will succeed on their platform, say, “How big is your sample size?” If they can’t answer, it will tell you something. If they answer “thousands of shows” or “millions of customers”, you’ll know they are bullshitting.
Because they are.