Newsroom

Summarization Dataset

Terms of Use

Download the Data

Download the complete training, development, and released test data, or scrape the data yourself using Archive.org URLs ("thin dataset") and the data builder scripts. Both downloads include tools for analyzing reference summaries, evaluating systems across extractive subsets of the dataset, and preparing system submissions for the unreleased test data.

Format

CORNELL NEWSROOM contains three large files for training, development, and released test sets. Each of these files uses the compressed JSON line format. Each line is an object representing a single article-summary pair. An example summary object:

{
        "text": "...",
     "summary": "...",
       "title": "...",
 "publication": "cnn.com",
     "archive": "http://...",
        "date": 20160302060024,
     "density": 1.25,
    "coverage": 0.75,
 "compression": 12.5,
      "extbin": "mixed",
      "sumbin": "long",
     "textbin": "medium",
      "subset": "train",
}

The date is an integer using the Internet Archive date format: YYYYMMDDHHMMSS. Density and coverage scores are provided for convenience, computed using the summary analysis tool also provided. Data subset and subsets by extractiveness, summary length, text length are also provided. For example, in Python, each data file can be read as follows:

import json, bz2

path = "train.jsonl.bz2"
data = []

with bz2.open(path) as f:
    for ln in f:
        obj = json.loads(ln)
        data.append(obj)