SUC Novels (StorSUC)
https://doi.org/10.23695/6BFJ-NE40
STORSUC is a bonus material distributed with SUC2.0 (see below). The material is not formally included in SUC and does not adhere to the corpus format. It has not been annotated, but is only structured into paragraph-like segments. STORSUC is not balanced.
The Stockholm-Umeå Corpus (SUC) is a collection of Swedish texts from
the 1990's, consisting of one million words in total. The corpus is
balanced, meaning that it contains various text types and stylistic
levels. The texts are annotated with part-of-speech tags,
morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information.
Version 1.0 was developed in co-operation between Gunnel
Källgren at Stockholm University and Eva Ejerhed at Umeå University
and was made available in 1997 by the department of linguistics at
Stockholm University. Version 2.0 was made available in 2006 by
Sofia Gustafson-Capková and Britt Hartmann at the department of
linguistics at Stockholm University. It contains the same texts as SUC
1.0 but is extended with some annotation. Additionally, SUC 2.0
contains bonus materials. TigerSUC is SUC 2.0 converted to TIGER-XML
by Martin Volk. StorSUC is additional SUC material of four million
words. Version 3.0 is available since 2012. It contains
improved annotations, and unannotated texts with seven million
words. (For the TigerXML-version, Suc2c, Suc2d, and the DTDs we still
refer to version 2.0.)
Additional information about the compilation and annotation of SUC can be found in the SUC 2.0 manual [PDF].
Språkbanken distributes SUC 2.0 and SUC 3.0 in two variants.
The official corpus
SUC is freely available for research, but requires that every user
signs an individual license with the department of linguistics at
Stockholm University. Since December 1st 2008, SUC licensing is
delegated to Språkbanken Text at the University of Gothenburg.
Appendix 3 of
the SUC license [PDF] needs to be printed, signed, and sent to
SUC-licens
Språkbanken
Institutionen för svenska, flerspråkighet och språkteknologi
Göteborgs universitet
Box 200
405 30 Göteborg
When we have received and registered the signed license, we will contact you with a download link.
Scrambled corpus with additional automatic annotation
A second variant can be downloaded directly under the open licence CC BY-SA, below. The order of the sentences in this version has been scrambled, and extra annotation has been added automatically by Språkbankens processing pipeline. The corpus is distributed in Språnkbankens default XML format.
The following annotation is taken from the official version:
Part of speech (pos attributes of word elements)
Morphology (msd attributes)
Lemma (lemma attributes)
Named entity (SUC 3.0 only; tags, not the tags)
All other annotation, like the linking against Saldo, the dependency parses, and alternative named entity annotation ( tags), was created automatically
by Sparv.
It is this variant of SUC that can be explored in Korp
Go to data source
Opens in a new tabhttps://doi.org/10.23695/6BFJ-NE40
Citation and access
Citation and access
Creator/Principal investigator(s):
Research principal:
Citation:
Language:
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Metadata
Metadata
