Croatian web corpus MaCoCu-hr 2.0
In: https://macocu.eu/, 2023
Online
unknown
Zugriff:
The Croatian web corpus MaCoCu-hr 2.0 was built by crawling the ".hr" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using ...
Titel: |
Croatian web corpus MaCoCu-hr 2.0
|
---|---|
Autor/in / Beteiligte Person: | Bañón, Marta ; Chichirau, Malina ; Esplà-Gomis, Miquel ; Forcada, Mikel L. ; Galiano-Jiménez, Aarón ; García-Romero, Cristian ; Kuzman, Taja ; Ljubešić, Nikola ; van Noord, Rik ; Pla Sempere, Leopoldo ; Ramírez-Sánchez, Gema ; Rupnik, Peter ; Suchomel, Vít ; Toral, Antonio ; Zaragoza-Bernabeu, Jaume |
Link: | |
Zeitschrift: | https://macocu.eu/, 2023 |
Veröffentlichung: | Jožef Stefan Institute ; Prompsit ; Rijksuniversiteit Groningen ; Universitat d'Alacant, 2023 |
Medientyp: | unknown |
Schlagwort: |
|
Sonstiges: |
|