Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec
In: Information Sciences, Jg. 477 (2019-03-01), S. 15-29
Online
unknown
Zugriff:
The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.
Titel: |
Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec
|
---|---|
Autor/in / Beteiligte Person: | Kim, Donghwa ; Cho, Suhyoun ; Kang, Pilsung ; Seo, Deokseong |
Link: | |
Zeitschrift: | Information Sciences, Jg. 477 (2019-03-01), S. 15-29 |
Veröffentlichung: | Elsevier BV, 2019 |
Medientyp: | unknown |
ISSN: | 0020-0255 (print) |
DOI: | 10.1016/j.ins.2018.10.006 |
Schlagwort: |
|
Sonstiges: |
|