LETOR Dataset

Istella is glad to release the Istella Learning to Rank (LETOR) dataset to the public. This dataset has been used in the past to learn one of the stages of the Istella production ranking pipeline. To the best of our knowledge, this is the largest publicly available LETOR dataset, particularly useful for large-scale experiments on the efficiency and scalability of LETOR solutions.

To use the dataset, you must read and accept the Istella LETOR Licence Agreement. By using the dataset, you agree to be bound by the terms of the license: Istella dataset is solely for non-commercial use.

The Istella LETOR full dataset is composed of 33,018 queries and 220 features representing each query-document pair. It consists of 10,454,629 examples labeled with relevance judgments ranging from 0 (irrelevant) to 4 (perfectly relevant). The average number of per-query examples is 316. It has been splitted in train and test sets according to a 80%-20% scheme.

If you want to use the dataset in your research, you can download Istella LETOR here. In case you use it, we kindly ask you to acknowledge Istella SpA and cite the following publication in your research:

Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini
Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees.
ACM Trans. Inf. Syst. 35, 2, Article 15 (December 2016), 31 pages.
DOI: https://doi.org/10.1145/2987380

We also made available a smaller sample of the dataset (named Istella-S LETOR). As the Istella LETOR, it is composed of 33,018 queries and 220 features representing each query-document pair. Istella-S LETOR consists of 3,408,630 pairs produced by sampling irrelevant pairs to an average of 103 examples per query. It has been splitted in train, validation and test sets according to a 60%-20%-20% scheme. If you want to use the dataset in your research, you can download Istella-S LETOR here. In case you use it, we kindly ask you to acknowledge Istella SpA and cite the following publication in your research:

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani
Post-Learning Optimization of Tree Ensembles for Efficient Ranking
In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, USA, 949-952.
DOI: http://dx.doi.org/10.1145/2911451.2914763

We made available a bigger dataset (named Istella-X (eXtended) LETOR). It is composed of 10,000 queries and 220 features representing each query-document pair. Istella-X LETOR consists of 26,791,447 pairs produced by retrieving up to 5,000 documents per query according to the BM25F ranking score. It has been splitted in train, validation and test sets according to a 60%-20%-20% scheme. If you want to use the Istella-X dataset in your research, you can download Istella-X LETOR here. In case you use it, we kindly ask you to acknowledge Istella SpA and cite the following publication in your research:

Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, Salvatore Orlando, Salvatore Trani
Selective Gradient Boosting for Effective Learning to Rank
In Proceedings of the 41th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA.
DOI: http://dx.doi.org/10.1145/3209978.3210048

Datasets

LETOR Dataset