Challenges in web corpus construction for low-resource languages in a post-BootCaT world
In: Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference, Less Resourced Languages special track 6th Language & Technology Conference, Less Resourced Languages special track 6th Language & Technology Conference, Less Resourced Languages special track, Dec 2013, Poznan, Poland. pp.69-73; (2013-12-07)
Online
unknown
Zugriff:
Software available under an open-source license: FLUX: Filtering and Language-identification for URL Crawling Seeds https://github.com/adbar/flux-toolchain; International audience; The state of the art tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, this querying process has become very slow or impossible to perform on a low budget. In order to find reliable data sources for Indonesian, I perform a case study of different kinds of URL sources and crawling strategies. First, I classify URLs extracted from the Open Directory Project and Wikipedia for Indonesian, Malay, Danish, and Swedish in order to enable comparisons. Then I perform web crawls focusing on Indonesian and using the mentioned sources as the start URLs. My scouting approach using open-source software results in a URL database with metadata which can be used to replace or at least to complement the BootCaT approach.
Titel: |
Challenges in web corpus construction for low-resource languages in a post-BootCaT world
|
---|---|
Autor/in / Beteiligte Person: | Barbaresi, Adrien ; Interactions, Corpus, Apprentissages, Représentations (ICAR) ; École normale supérieure de Lyon (ENS de Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS) ; This work has been partially funded by an internal grant of the FU Berlin, COW (COrpora from the Web) project at the German Grammar Department. ; Les auteurs remercient le LABEX ASLAN (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme 'Investissements d'Avenir' (ANR-11-IDEX-0007) de l'Etat Français géré par l'Agence Nationale de la Recherche (ANR). ; École normale supérieure - Lyon (ENS Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS) |
Link: | |
Quelle: | Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference, Less Resourced Languages special track 6th Language & Technology Conference, Less Resourced Languages special track 6th Language & Technology Conference, Less Resourced Languages special track, Dec 2013, Poznan, Poland. pp.69-73; (2013-12-07) |
Veröffentlichung: | HAL CCSD, 2013 |
Medientyp: | unknown |
Schlagwort: |
|
Sonstiges: |
|