Einzeltreffer — DigiBib

The Croatian web corpus MaCoCu-hr 2.0 was built by crawling the ".hr" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using ...

Titel:	Croatian web corpus MaCoCu-hr 2.0
Autor/in / Beteiligte Person:	Bañón, Marta ; Chichirau, Malina ; Esplà-Gomis, Miquel ; Forcada, Mikel L. ; Galiano-Jiménez, Aarón ; García-Romero, Cristian ; Kuzman, Taja ; Ljubešić, Nikola ; van Noord, Rik ; Pla Sempere, Leopoldo ; Ramírez-Sánchez, Gema ; Rupnik, Peter ; Suchomel, Vít ; Toral, Antonio ; Zaragoza-Bernabeu, Jaume
Link:	https://hdl.handle.net/11370/685514a8-947e-44f9-83cf-90356c5f1684 http://hdl.handle.net/11356/1516 Volltext
Zeitschrift:	https://macocu.eu/, 2023
Veröffentlichung:	Jožef Stefan Institute ; Prompsit ; Rijksuniversiteit Groningen ; Universitat d'Alacant, 2023
Medientyp:	unknown
Schlagwort:	web corpus
Sonstiges:	Nachgewiesen in: BASE Sprachen: Croatian Collection: Linguistic Data and NLP Tools (CLARIN - Common Language Resources and Technology Infrastructure, Slovenia) Document Type: other/unknown material File Description: text/plain; charset=utf-8; application/octet-stream; application/zip; downloadable_files_count: 2 Language: Croatian Rights: CC0-No Rights Reserved ; https://creativecommons.org/publicdomain/zero/1.0/ ; PUB

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.