IMDB Top 100K Movies – Analysis in Depth, Part 2

izzydata · on Feb 24, 2014

When did IMDB go online? This seems like one of those things where nostalgia factor is pretty big. If people were rating those movies from a few decades ago at the time they were in theaters then these graphs might look pretty flat.

That or it is becoming increasingly harder to entertain people with new ideas as they watch more and more movies. People could be getting desensitized to entertainment.

thrownaway2424 · on Feb 24, 2014

IMDB is practically the oldest thing on the web. The site launched in 1993 but it imported the archives of relevant usenet (rec.arts.movies) and mail archives going back to the 80s.

carlmcqueen · on Feb 24, 2014

My thoughts were along the exact same lines. I would like to see results on meta-critic/IMDB cut off when the sites actually started and movies that came out after that date.

I also wonder about the people who spend a lot of time rating old movies that they've seen and how much of their score is influenced by 'bulk' ratings of movies.

Does your score of certain movies go up because you've just rated 15 terrible movies or even just alright movies?

example: Batman 1 Batman 2 Batman 3 Batman 4 then rating the new batmans? How does the score change?

burntsushi · on Feb 24, 2014

> When did IMDB go online?

There is a fantastic chronology of the origins of IMDb here: ftp://ftp.fu-berlin.de/pub/misc/movies/database/tools/movie-database-faq

Do a CTRL-F for "8. A brief history of the whole project" (no quotes) to jump to the relevant section.

(Sorry about the FTP link, but any modern browser should be able to handle it.)

beloch · on Feb 24, 2014

Impressive work! I have questions/ideas:

Is there any way to extract movie ratings vs time from your data-set? I've subjectively noticed a few patterns in movie ratings, and would love to see quantitative confirmation.

1. Rating quality tends to spike at or shortly after a film's release, but then drops off gradually as the glow of the PR campaign wears off, or perhaps because people who go to see movies in their opening week are more predisposed to like them (e.g. fanboys, etc.). I usually expect a film that has just been released to be similar in quality to a film that's been on home video for a while with a rating half a point or more lower. (I am ashamed to admit that I watch enough movies that I can usually peg it's IMDB rating to within a quarter of a point independently from how much I personally enjoyed the film. )

2. Rating quality for older films tends to be held back by poor quality home video releases and often improves significantly, but gradually, after quality transfers are released. e.g. I distinctly remember seeing Zulu several years ago and thinking, "Gee, that film was way better than the IMDb rating!". Back then, it was in the high 6 range. Now it's up to 7.8!

I suspect IMDB does not keep track of times associated with votes, or at least does not provide that data publicly, so the best you could do would be to crawl IMDb periodically. It shouldn't take more than a year's worth of data to see if point #1 holds up. #2 is a lot more difficult because the effect is more gradual, and you'd need to start bringing in other data sources, like audio/video quality ratings from home video review sites that do tend to be somewhat unreliable. I find #2 to be a question of interest though. If you found any other form of correlations with titles whose ratings improve substantially after quality home video releases, you would have potentially found a way to identify under-appreciated films. If you pulled that off, in addition to discovering some good flicks to watch, companies like Kino and Criterion would probably start knocking at your door!

burntsushi · on Feb 24, 2014

> Is there any way to extract movie ratings vs time from your data-set? I've subjectively noticed a few patterns in movie ratings, and would love to see quantitative confirmation.

I'm not the OP, but I'm familiar with the data IMDb provides. The short answer to your question is: not easily.

IMDb provides a plain text dump of a subset of their data here: ftp://ftp.fu-berlin.de/pub/misc/movies/database/ --- it's updated (usually) once a week. So to get temporal data, I guess you'd have to track it yourself.

However, diffs are provided for each data update: ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/

It looks like this could give you temporal data at the granularity of once per week.

bugra · on Feb 25, 2014

1. The timestamps are only years so I cannot use at least the information that IMDB provides for such analysis. That would be quite interesting, though. Also, what time of year that movie is released and how does that affect to the movie's overall success could be another interesting point.

2. That is a good suggestion but I think we need more votes and more importantly the users whose votes could be taken as a basis of quality of the movie.

ape4 · on Feb 24, 2014

Even though the internet is mainstream, I think imdb's rating are still a bit "nerdy". Granny doesn't go to imdb to rate a movie.

eg Batman: The Dark Knight Rises gets 8.6 http://www.imdb.com/title/tt1345836/

pmr_ · on Feb 24, 2014

No one ever is considering what Granny thinks when movie or television ratings are concerned. The age groups normally considered are 18-34 or 18-49, because those are the ones which far outweigh all others in terms of commercial interest. Also, all other age groups don't make up a significant part of the movie going crowd so even if they would contribute ratings it wouldn't make much of a difference.

atestu · on Feb 24, 2014

I'm not sure it's "nerdy" since the dark knight rises is one of the most commercially successful movies of all time. Hardly a niche film.

pessimizer · on Feb 24, 2014

"Yet, [old movies] are getting better although not as high as old movies."

1) Don't forget that the raters are self selecting. For very recent movies, the ratings will only be from people who were attracted by hype, and were excited enough to immediately rate a film.

2) All of those ratings will eventually drop as people outside of the target audience(outside of the natural consumers of initial hype) rate the film, reverting the rating to the mean.

3) Most people that are aware of a particular older film are people who remember it because it was/is a favorite, or people who encountered it recently and found it interesting enough to research it on imdb, both probably tending to skew positive.

If you were able to collect the changes in ratings over time, I'd bet that dip of low ratings would always be around the same distance from the present.

I might be off on this, though, because there will always be a qualitative separation between movies that were ratable at the time of release, and movies released pre-imdb.

bugra · on Feb 25, 2014

1. Yes, I agree with you and I wrote on the first post is that that is mostly due to selection bias. http://bugra.github.io/work/notes/2014-02-15/imdb-top-100K-m... 2) That is true but I think that holds for __all__ of the movies not a subset of them. 3) I am not quite sure about that but I think that would be quite interesting look at the temporal information correlation to the rating of the movie.

brownbat · on Feb 24, 2014

Really enjoying the analysis.

I hope you'll get a chance to check out writers after actors. It'd be a really interesting test of Schrieber theory, which argues that the writer is a better predictor of a film's quality than the director.

http://en.wikipedia.org/wiki/Schreiber_theory

(I had a similar comment last time, sorry for hounding the same issue, just not sure if you saw it.)

bugra · on Feb 25, 2014

I missed the comment in the previous one, sorry about that. I am not familiar with the theory but if I could get the writers information(right now I only have the directors) that would be interesting aspect of data to investigate.

AimHere · on Feb 24, 2014

One thing I was hoping this series would shed light on was an oddity in demographic voting patterns with the ratings system.

From my own (anecdotal) observation, there's a class of films where the 'Females 45+' rating is a big outlier in the voting, with a large proportion of '1' votes. it's as if there's a gang of a few hundred users consistently downvoting films who are classed by IMDB as being part of that demographic - I don't know if it's a default setting that some bots or trolls have used or if there really are a bunch of cinephobic old women out there just watching and hating on films, but it did spark my curiosity as to what the pattern was. It doesn't happen in every film, and it's rare that you see any other demographic sticking out like this.

The best examples are critically acclaimed films with low voter counts. More popular, or less well-received films tend to have a lot of negative noise to begin with, so the effect is less noticeable, if it's there at all.

Here's some examples:

Army of Shadows : http://www.imdb.com/title/tt0064040/ratings?ref_=tt_ov_rt

Spartacus : http://www.imdb.com/title/tt0054331/ratings?ref_=tt_ov_rt

Ikiru : http://www.imdb.com/title/tt0044741/ratings?ref_=tt_ov_rt

Passion of Joan of Arc : http://www.imdb.com/title/tt0019254/ratings?ref_=tt_ov_rt

Breathless : http://www.imdb.com/title/tt0053472/ratings?ref_=tt_ov_rt

Z : http://www.imdb.com/title/tt0065234/?ref_=fn_al_tt_2

Barry Lyndon : http://www.imdb.com/title/tt0072684/ratings?ref_=tt_ov_rt

bugra · on Feb 25, 2014

That is exactly why I generally prefer median-like average methods versus mean-like averages in these type of crowdsourced systems. Generally speaking, average votes are distributed evenly whereas the ratings on the edges are not. The movies that you gave as examples could be due to preference of voters for particular type of movies.

AimHere · on Feb 25, 2014

I don't think it would make much odds for IMDB. Sure you'll maybe negate the overly high proportion of extreme votes, but I don't think that effect is large. I did once take the top films on IMDB and remove the extreme votes, and the effect was marginal - a few films shuffled a place or so. In exchange, if you use the median, you could have polarising films - like Twilight, having their ratings change markedly over time, as people decide to start warring over it and the numbers swamp the small number of ambivalents.

For your other point, there will be voting patterns for various demographics and certain groups of people prefer certain movies. But I think the phenomenon I'm describing is a bit too extreme to be merely a bunch of old women disproportionately taking a dislike to Ikiru and then rushing onto the internet to tell the world they hate it. It is the smallest demographic IMDB has, so it's not hurting the ratings of the films in general, it's just an oddity.

bemmu · on Feb 24, 2014

Next time please also include a simple list of top 1000 movies, it would be helpful (without movies that received too few votes, of course).

AimHere · on Feb 24, 2014

IMDB provides that by itself:

http://www.imdb.com/search/title?groups=top_1000&sort=user_r...

wslh · on Feb 24, 2014

Do you think that doing text mining on the films subtitles will add interesting information?

bugra · on Feb 25, 2014

So, I tried to do cluster the outlines but generally they are quite short and does not reveal much about what the movies are about but rather the plot. So, I am a little skeptical if it yields good results for subtitles.

wslh · on Feb 25, 2014

Ok, it was just an idea.

jedberg · on Feb 24, 2014

All of this data assumes that IMDB is a complete record of all films.

I don't believe it is, and I think the further back you go, the more likely it is that only well-regarded films are in the database.

pessimizer · on Feb 24, 2014

As databases go, I'd bet it's as complete as it gets. There are lots of awful films from the 20s and 30s in there.