Argumentation sentences 1.0
https://doi.org/10.23695/56T6-RC52
I. IDENTIFYING INFORMATION
Title*
Argumentation sentences
Subtitle
A translated corpus for classifying sentence stance in relation to a topic.
Created by*
Anna Lindahl (anna.lindahl@svenska.gu.seÖppnas i en ny tabb)
Publisher(s)*
Språkbanken Text (sb-info@svenska.gu.seÖppnas i en ny tabb)
Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/en/resources/superlimÖppnas i en ny tabb
License(s)*
CC BY 4.0
Abstract*
Argumentation sentences is a translated corpus for the task of identifying stance in relation to a topic. It consists of sentences labeled with pro, con or non in relation to one of six topics. The original dataset [1] can be found here https://github.com/trtm/AURCÖppnas i en ny tabb. The test set is manually corrected translations, the training set is machine translated.
Funded by*
Vinnova (grant no. 2021-04165)
Cite as
Related datasets
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlimÖppnas i en ny tabb)
II. USAGE
Key applications
Machine learning, argumentation mining, stance classification
Intended task(s)/usage(s)
Evaluate models on the following task: Given a sentence and a topic, determine if the sentence is for, against or neutral in relation to the topic.
Recommended evaluation measures
Krippendorff’s alpha (the official SuperLim measure), MCC, F
Dataset function(s)
Training, testing
Recommended split(s)
Train, dev, test (provided)
III. DATA
Primary data*
Text
Language*
Swedish
Dataset in numbers*
5265 sentences split over 6 topics, 3450 train, 750 dev and 1065 test
Nature of the content*
Topics: Abortion, Death penalty, Nuclear power, Marijuana legalization, Minimum wage, Cloning. Each topic has a set of associated sentences, lableled with pro, con or non in relation to the topic.
Format*
Jsonl with the following keys: sentence_id = the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself
Tab-separated with 4 columns: the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself
Data source(s)*
The original data comes from the AURC dataset [1] ( https://github.com/trtm/AURCÖppnas i en ny tabb). For this corpus, only the in-domain topics were used.
Data collection method(s)*
Collected from the Common Crawl archive. See [1]
Data selection and filtering*
A subset of the original data, only the in-domain topics are used.
Data preprocessing*
Sentences were machine translated. The test set was then manually corrected.
Data labeling*
The sentences are labeled with pro, con or non, signifying their stance in relation to a topic.
Annotator characteristics
IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for
V. ABOUT DOCUMENTATION
Data last updated*
20221215
Which changes have been made, compared to the previous version*
First version
Access to previous versions
This document created*
20221215 by Anna Lindahl
This document last updated*
20220203 by Anna Lindahl
Where to look for further details
Documentation template version*
v1.1
VI. OTHER
Related projects
References
[1] Trautmann, D., Daxenberger, J., Stab, C., Schütze, H., & Gurevych, I. (2020, April). Fine-grained argument unit recognition and classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9048-9056).
Gå till källa för data
Öppnas i en ny tabbhttps://doi.org/10.23695/56T6-RC52
Citering och åtkomst
Citering och åtkomst
Skapare/primärforskare:
- Lindahl, Anna
Forskningshuvudman:
Citering:
Språk:
Administrativ information
Administrativ information
Ämnesområde och nyckelord
Ämnesområde och nyckelord
Relationer
Relationer
Metadata
Metadata
