The Climbing and Madonna datasets

The Climbing and Madonna datasets are collections of videos of a same scene taken by different types of cameras, from different locations. The videos have overlapping temporal extents. The overlaps are annotated manually. The datasets can be used to evaluate the content-based automatic alignment of video clips.

To reference the dataset, please use the following citation:

  TITLE = {{Circulant temporal encoding for video retrieval and temporal alignment}},
  AUTHOR = {Douze, Matthijs and Revaud, J{\'e}r{\^o}me and Verbeek, Jakob and J{\'e}gou, Herv{\'e} and Schmid, Cordelia},
  URL = {},
  JOURNAL = {{International Journal of Computer Vision}},
  PUBLISHER = {{Springer Verlag}},
  YEAR = {2016},
  PDF = {},

This table sums up the provided data for the datasets:

Climbing Madonna
Videos climbing/ madonna_align_ids.txt
ground-truth gt_climbing.align gt_madonna.align
SIFT descriptors climbing_sift/ madonna_sift/
HOF descriptors climbing_hof/ madonna_hof/
SIFT align result found_climbing_sift.align found_madonna_sift.align
HOF align result found_climbing_hof.align found_madonna_hof.align

Videos & descriptors

The Climbing and Madonna datasets contain 89 and 163 videos respectively.

The descriptors and videos are in a directory structure. The whole directory can be retrieved with eg. wget -r

The descriptors are provided as matrices in the .fvecs file format, one column per frame (at 15 fps) and 1024D per descriptor. The 1024 dimensions are in decreasing order of the PCA components. To use them: keep d <= 1024 dimensions and L2 normalize, then the dot product is a reasonable comparison metric. Note that there are not exactly the same number of frames in HOF and SIFT because of different video decoders.


Each directory corresponds to one of the cameras that were used. Within each directory, the alphabetical order gives the ordering of videos. All videos were converted to a same codec from widely diverging file types. If you are interested in the original files, contact us.


The Madonna dataset is was retrieved in the same way as the "mdna+rome+2012" event in EVVE. The list of Youtube ids is given in the table above. Contact us if you need help to get the videos themselves from Youtube.


Align file format

The align file format contains one line per video segment. For example:
mdna+rome+2012/UJiQE4hEtio 523.08 544.81 -521.30 18 0


Since the segments are rushes, ie. they coincide with the videos, all segments start at 0 and end at the length of the video.


For Madonna, many videos are exact copies.

Annotation protocol

The annotation was done manually with AlignGUI, without using the automatic or semi-automatic mode, to avoid bias towards the indexing algorithm.

The alignment precision is at least 0.5 s, in reality it is a lot better for most clips.

There are additional videos in both Climbing and in Madonna (and additional unannotated footage in the latter). This data is not taken into account during evaluation.

Evaluation script

The evaluation script for the PAS is here: Usage:
  python result_file.align groundtruth_file.align
If segments of a video overlap, the script rejects it, but all possible segments of a video do not need to appear in the file.

The ground-truth alignment has the same format. To compute the maximum possible score (PAS for groundtruth), just run the script on the grountruth file.

NB that for climbing the PAS score is not exactly 108 because of alignment inaccuracies that translate in incomplete overlaps.


The source for the GUI program that does (semi-)interactive alignment is here: align_gui-1.0.tgz. To make it work on eg. the climbing videos, the videos must be accessible in a directory along with thumbnails for the frames. For the Climbing dataset, the thumbnails are precomputed here: climbing_thumbnails/.

Result videos

At each time step, all aligned video frames are laid out together in a joint visualization. In this way, the quality of the temporal alignment is directly visible. The grouping of the videos, their alignment as well as the layout of the output is completely automatic. We show hereafter a few interesting examples from Madonna and other videos from the EVVE dataset.

An example from the Madonna dataset. Many people have filmed the concert with their mobile phones, and posted the result on Youtube (some videos are duplicates): vid1, vid2.

Eruptions of the Strokkur geyser are aligned temporally. Some of them show slight misalignments, due to the varying aspect of the geyser: vid.

Universal Studio's Jurassic Park attraction. This group of videos covers almost all of the ride through the Jurassic Park attraction in Universal Studios. vid.

Legal, contact

The Climbing videos may be used for any scientific usage, it can be used to illustrate publications, etc. The Madonna videos has the same license as the EVVE dataset.

The descriptors and annotations are free.

For any question, remark, bug report, contact matthijs dot douze at inria dot fr

Last modified: 2016-01-08