Resources | Computational Linguistics Lab

Datasets for evaluating compositional distributional models of meaning

Grefenstette and Sadrzadeh Compositional Distributional Model Evaluation Dataset, EMNLP 2011 (file)
Grefenstette and Sadrzadeh Compositional Distributional Model Evaluation Dataset, adjective-noun based transitive sentences, 2012 (file)
Kartsaklis, Sadrzadeh and Pulman Term-Definition Dataset, COLING 2012 (file)
Disambiguation Dataset used in Kartsaklis et al., CoNLL 2013. This previously unpublished dataset was produced by Mehrnoosh Sadrzadeh and Edward Grefenstette (file)
Kartsaklis and Sadrzadeh transitive sentence similarity dataset, EMNLP 2013. This dataset extends the verb-object part of the Mitchell and Lapata (2010) dataset by the introduction of appropriate subject nouns. This version uses the original human judgements from the M&L 2010 dataset (file)
Kartsaklis and Sadrzadeh transitive sentence similarity dataset, QPL 2014. This is the same dataset as the one used in the EMNLP 2013 paper, but with re-evaluated human scores collected from Amazon Turk (file)
Kartsaklis and Sadrzadeh entailment datasets, LACL 2016/COLING 2016 (subject-verb, verb-object, and subject-verb-object)
Wijnholds and Sadrzadeh verb disambiguation and sentence similarity datasets involving verb phrase ellipsis, NAACL 2019 (repo)

Word embeddings

Cheng and Kartsaklis code (in Python/Theano) and word embeddings, EMNLP 2015 (link)

Code

Code for reproducing the self- and other-repair detection experiments of Purver, Hough and Howes (TopiCS 2018).
Code and documentation for reproducing the categorical compositional models of Grefenstette and Sadrzadeh (EMNLP 2011), Kartsaklis et al. (COLING 2012), and Milajevs et al. (EMNLP 2014).
STIR (“Strictly Incremental Repair Detection”) – an open-source set of tools for self-repair detection in dialogue data.
DyLan (“Dynamics of Language”) – an open-source Java implementation of Dynamic Syntax, including word-by-word incremental semantic parser and generator, and integration with the Jindigo dialogue system.
DiaSim – an open-source Java project for calculating lexical, syntactic and semantic similarity in dialogue corpora, including within- and between-speaker similarity and comparison to various randomly re-ordered baselines – see this paper.