Newsroom

Summarization Dataset

Terms of Use

This Dataset Usage Agreement ("Agreement") is a legal agreement with the Cornell Newsroom Summaries Team ("Newsroom") for the Dataset made available to the individual or entity ("Researcher") exercising rights under this Agreement. "Dataset" includes all text, data, information, source code, and any related materials, documentation, files, media, updates or revisions.

The Dataset is intended for non-commercial research and educational purposes only, and is made available free of charge without extending any license or other intellectual property rights. By downloading or using the Dataset, the Researcher acknowledges that they agree to the terms in this Agreement, and represent and warrant that they have authority to do so on behalf of any entity exercising rights under this Agreement. The Researcher accepts and agrees to be bound by the terms and conditions of this Agreement. If the Researcher does not agree to this Agreement, they may not download or use the Dataset.

By sharing content with Newsroom, such as by submitting content to this site or by corresponding with Newsroom contributors, the Researcher grants Newsroom the right to use, reproduce, display, perform, adapt, modify, distribute, have distributed, and promote the content in any form, anywhere and for any purpose, such as for evaluating and comparing summarization systems. Nothing in this Agreement shall obligate Newsroom to provide any support for the Dataset. Any feedback, suggestions, ideas, comments, improvements given by the Researcher related to the Dataset is voluntarily given, and may be used by Newsroom without obligation or restriction of any kind.

The Researcher accepts full responsibility for their use of the Dataset and shall defend indemnify, and hold harmless Newsroom, including their employees, trustees, officers, and agents, against any and all claims arising from the Researcher's use of the Dataset. The Researcher agrees to comply with all laws and regulations as they relate to access to and use of the Dataset and Service including U.S. export jurisdiction and other U.S. and international regulations.

THE DATASET IS PROVIDED "AS IS." NEWSROOM DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. WITHOUT LIMITATION OF THE ABOVE, NEWSROOM DISCLAIMS ANY WARRANTY THAT DATASET IS BUG OR ERROR-FREE, AND GRANTS NO WARRANTY REGARDING ITS USE OR THE RESULTS THEREFROM INCLUDING, WITHOUT LIMITATION, ITS CORRECTNESS, ACCURACY, OR RELIABILITY. THE DATASET IS NOT WARRANTIED TO FULFILL ANY PARTICULAR PURPOSES OR NEEDS.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT SHALL NEWSROOM BE LIABLE FOR ANY LOSS, DAMAGE OR INJURY, DIRECT AND INDIRECT, INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER FOR BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE) OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, INCLUDING BUT NOT LIMITED TO LOSS OF PROFITS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY.

This Agreement is effective until terminated. Newsroom reserves the right to terminate the Researcher's access to the Dataset at any time. If the Researcher breaches this Agreement, the Researcher's rights to use the Dataset shall terminate automatically. The Researcher will immediately cease all use and distribution of the Dataset and destroy any copies or portions of the Dataset in their possession.

This Agreement is governed by the laws of the State of New York, without regard to conflict of law principles. All terms and provisions of this Agreement shall, if possible, be construed in a manner which makes them valid, but in the event any term or provision of this Agreement is found by a court of competent jurisdiction to be illegal or unenforceable, the validity or enforceability of the remainder of this Agreement shall not be affected.

This Agreement is the complete and exclusive agreement between the parties with respect to its subject matter and supersedes all prior or contemporaneous oral or written agreements or understandings relating to the subject matter.

Download the Data

Download the complete training, development, and released test data, or scrape the data yourself using Archive.org URLs ("thin dataset") and the data builder scripts. Both downloads include tools for analyzing reference summaries, evaluating systems across extractive subsets of the dataset, and preparing system submissions for the unreleased test data.

Format

CORNELL NEWSROOM contains three large files for training, development, and released test sets. Each of these files uses the compressed JSON line format. Each line is an object representing a single article-summary pair. An example summary object:

{
        "text": "...",
     "summary": "...",
       "title": "...",
 "publication": "cnn.com",
     "archive": "http://...",
        "date": 20160302060024,
     "density": 1.25,
    "coverage": 0.75,
 "compression": 12.5,
      "extbin": "mixed",
      "sumbin": "long",
     "textbin": "medium",
      "subset": "train",
}

The date is an integer using the Internet Archive date format: YYYYMMDDHHMMSS. Density and coverage scores are provided for convenience, computed using the summary analysis tool also provided. Data subset and subsets by extractiveness, summary length, text length are also provided. For example, in Python, each data file can be read as follows:

import json, gz

path = "train.jsonl.gz"
data = []

with gz.open(path) as f:
    for ln in f:
        obj = json.loads(ln)
        data.append(obj)