The EVVE video event dataset


This dataset contains 2375 + 620 videos which were returned to 13 different queries on Youtube. The total length of the videos is 166 hours.

It was introduced in our CVPR 2013 paper Event retrieval in large video collections with circulant temporal encoding, by J. Revaud, M. Douze, C. Schmid and H. Jégou.

The dataset has been annotated by M. Douze, J. Revaud, J. Delhumeau and H. Jégou.

The benchmark evaluates how algorithms perform in the recognition of videos related to a given event. The only input about this event is a query video. The videos to retrieve include (ordered from easiest to most difficult):

Therefore, the task is challenging, some videos are hard to analyze even for humans (although all ambiguous videos were removed).

Annotation protocol

A single annotator is in charge of all the videos for a query/event. He refines the definition of the event. Then he annotates all the returned videos as positive or negative, and discards ambiguous videos.

Half of the positive videos are used as queries, and the remaining half plus the negatives make up the database (so that there is no intersection between the queries and the database).


Each of the text queries corresponds to an event. The events are identified by the query string that was used (for example universal+studios+jurassic+park+ride).

The videos are identified by their Youtube ID (11 alphanumeric characters). For example, the video with id YpctrE62nfs can be viewed at address

The list of events and their definitions are here.

Annotation format

For each event, there is an annotation file in text format where each line looks like:
6QU0IG6ugCw -1 database

The annotations and the event descriptions are available here, or as a single ZIP file.

Evaluation protocol

The indexing algorithm searches each query video in the database. This produces an ordered list of result videos. These results are used to compute a mAP (mean Average Precision).

The script computes the mAP given a result list. See the -h option for the result file format.


We provide the frame descriptors we used in the paper. They are 1024D multi-vlad descriptors (see Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening, H. Jégou and O. Chum, ECCV 2012). Videos are reduced to a common size and frame rate, and we compute one descriptor per frame.

The per-frame multi-vlad descriptors are available here, and can be retrieved with

wget -r -I /data2/evve/descs

The per-video MMV descriptors (much more compact) are available in Matlab format here here.

Contact us if you need help to get the videos themselves from Youtube. Contact us also if you need the 100k distractor set.

Reproducing the paper's results (only MMV descriptor)

This package reproduces the MMV result of table 3 in the paper. Run search_mmv.m in Matlab/Octave after downloading the descriptors and annotations. It outputs a result file that can be evaluated with the script above. Uses fvecs_read from Yael.

Contact, legal

The software on this page is free (as in BSD), the descriptors are completely free. Details on the copyright of the original videos can be found on their Youtube pages.

For any comment or problem, contact me (matthijs dot douze at inria dot fr).


The Climbing and Madonna video alignment datasets.

Extracts & results


Here are a few representative images for each event.

Last modified: Mon Sep 2 12:56:32 CEST 2013