Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Kim, Donghwa ; Cho, Suhyoun ; et al.

In: Information Sciences, Jg. 477 (2019-03-01), S. 15-29

Online unknown

Zugriff:

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Titel:	Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec
Autor/in / Beteiligte Person:	Kim, Donghwa ; Cho, Suhyoun ; Kang, Pilsung ; Seo, Deokseong
Link:	View record in OpenAIRE (Volltext) View record from ScienceDirect (Volltext) E-Journal im Bestand der UB Hagen? https://doi.org/10.1016/j.ins.2018.10.006
Zeitschrift:	Information Sciences, Jg. 477 (2019-03-01), S. 15-29
Veröffentlichung:	Elsevier BV, 2019
Medientyp:	unknown
ISSN:	0020-0255 (print)
DOI:	10.1016/j.ins.2018.10.006
Schlagwort:	Information Systems and Management Computer science 02 engineering and technology Semi-supervised learning computer.software_genre Latent Dirichlet allocation Theoretical Computer Science symbols.namesake Artificial Intelligence 0202 electrical engineering, electronic engineering, information engineering Feature (machine learning) tf–idf Co-training Document classification 05 social sciences 050301 education Computer Science Applications ComputingMethodologies_PATTERNRECOGNITION Control and Systems Engineering Benchmark (computing) symbols Embedding 020201 artificial intelligence & image processing Data mining 0503 education computer Software
Sonstiges:	Nachgewiesen in: OpenAIRE Rights: CLOSED

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.