9 август 2012 г.

0 IMDb Top 250 Deconstructed / Part II

It seems it's rather easy to extract some more information about the movies in IMDb Top 250, in mostly automated way. So, in this part - genres, awards, money, ages and genders. Enjoy!

The genre distribution is more than expected - almost 70% of the movies are dramas, and almost 30% - thrillers. Then we have crimes and adventures with a bit more than 20% and so on. The sum is obviously over 100% because most of the movies fall into several genre categories:
Information for the budget is available for 215 of the movies. It seems Hollywood et al. spent almost $6 billion for... um... first-class cinema, which means about $25 M per movie (those numbers are without the inflation taken into account). Here is the table with the most expensive movies (with a budget more than $100 M) in the Top 250:

Obviously, the budget is correlated with the year the movie came out:

Unfortunately, IMDb gives very inconsistent information about the Box Office on the main page, so it is practically impossible to extract this in a reasonable way.

As for the awards - the Top 250 movies won 376 Oscars in total, meaning exactly one Oscar and a half per movie. What is interesting is that more than half of the movies are without the most prestigious award. Here are the biggest winners:
Let me just remind you that the best movie of all time or whatever (The Shawshank Redemption) is without an Oscar and one of the movies with 11 Oscars (Titanic) is out of Top 250.

I wanted to show you how the tables look by gender and age groups, but it seems impossible to do so. As I have mentioned in the previous part, the simple arithmetic average is distorted heavily by the fanboys' 10/10 and haters' 1/10. The vote breakdowns do not contain any information about the "regularity" of the voters, so the simple average gives extremely distorted chart. For example, if the IMDb formula for calculating the Bayesian average (mentioned in the previous post) is not used, A Separation climbs up from 102nd to 3rd place, just after The Dark Knight Rises.

Taking simple arithmetic average for the votes from 2 to 9 (ignoring the very biased ends of the spectrum) doesn't help much - the chart definitely looks a bit better, but without the proper normalization (which is unknown in this case) we find The Intouchables at the very top, voted by only 66000 people. Unfortunately, my attempt to recalculate the parameters in the formula for 2-to-9 average calculation, failed miserably, so instead of giving the full charts for males, females, kids and old men, I will show you some curious statistics related to those groups.

So, here is how the movies are ordered by percentage of male and female voters:
Notice how many western movies are featured in the top of the blue chart? And how much sugar in the pink one?

The average distribution by age groups is the following:

As expected, the IMDb is dominated by the aged 18-30 with more than 50%, while surprisingly the teenagers are only about 1%.

Here are the charts, sorted by the percentage of voters in certain age group:

I think the kids' chart is quite WTF by itself. Obviously, all the movies are from the last decade or two. However, I have no clue what The Artist and The King's Speech are doing so high in this chart... (Here, I must say that I find most personal favorites in the chart 30-44, which is not a big surprise.)

The most interesting table is the last one, though (aged 45+). None of these 20 movies is made after 1970. Moreover, the first movie from this millennium you find at 127th position and it's called The Artist. Of course, at the end of the chart is The Dark Knight Rises with its miserable 2.6% voted in this age group.

Here I end the deconstruction of the current IMDb Top 250, but the fun is just starting. Next time: IMDb vs. Rottentomatoes.

