Zum Hauptinhalt springen

Metadata for Efficient Management of Digital News Articles in Multilingual News Archives

Khan, Muzammil ; Alharbi, Yasser ; et al.
In: SAGE Open, Jg. 13 (2023-10-01)
Online academicJournal

Metadata for Efficient Management of Digital News Articles in Multilingual News Archives 

The digital news preservation and management of low-resource languages are challenging tasks, especially in vast collections. Unique identification of individual digital objects is possible with well-defined attributes to assure efficient management, such as access, retrieval, preservation, usability, and transformability. The metadata element set is required to maximize the available attributes related to the digital objects. To create a comprehensive metadata set that contains all the necessary attributes and data about the digital news objects. It is more challenging and complicated when the archive contains articles from low-resourced and morphologically complex languages like Urdu and Arabic, which is difficult for machines to understand. The study presents challenges in low-resource languages (LRL) and research challenges. This metadata will help to link news articles based on similarity with other news articles stored in the digital news stories archive (DNSA) and ensures accessibility. In this study, we introduced 38 metadata elements set for the digital news stories preservation (DNSP) framework, of which 16 are explicit and 12 are implicit metadata elements. The paper presents how the digital news stories archive (DNSA) is enhanced to a multilingual archive and discusses the digital news stories extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, that is, English, and low-resource languages (HRL), that is, Urdu and Arabic. The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively. The metadata extraction results show that HRL sources support all metadata elements as compared to LRL. The LRL has good support for explicit meta elements and many implicit meta elements with low extraction percentages. The LRL needs a more detailed study for accurate news content extraction and archiving for future access.

Keywords: metadata; low-resource language; high-resource language; challenges in preservation; multilingual archive; information systems

Introduction

The Internet is the leading resource that provides information and even holds a variety of information sources providing information related to every aspect of human life, such as weather forecasts travel deals, events happening locally and worldwide, and so on. This information can be accessed via the World Wide Web and web services ([19]). The generation of exponential web information will exceed the world's living brain capacity in 2025, as currently, the web information measured is 1,018 exabytes and 1,021 zettabytes ([8]; [34]).

Though WWW is a fast-growing source of information, it is fragile in nature. The information fragility causes this valuable scholarly, cultural, and scientific information to vanish and become inaccessible to future generations. Therefore, there is a need to preserve the information available in different forms.

The newspaper has been the main source of information for thousands of years. The newspapers cover information related to different aspects of human life and provide information about the events happening locally and worldwide. Newspapers cover stories about various events like acts of parliaments, events of political importance for countries, proceedings of courts related to important cases, births, deaths, marriages, sports, science, technology, and so on. Newspapers reflect the social life, behaviors, and cultural values of different communities, and thence these are vital scholarly information for community individuals and even the community as a whole. To ensure that this information is available to future generations. For example, the prime minister's address to the assembly after winning an election or the packages announced of an imminent foreign invasion of a country becomes valuable to future generations as historical manuscripts are today. According to the UNESCO declaration on archives, the archives play a vital role in the development of societies by safeguarding the contributions of individuals and communities (UNESCO: Universal declaration on archives. In Adopted at the ICA Annual General Meeting in Malta ([36]). The only way to safeguard this published information is to preserve and make it available to the forthcoming generations. Several initiatives have been taken, and numerous newspaper archives have been created to preserve this published information. Most of the curators shared that they or their organizations manage to preserve several newspapers and maintain digital newspaper collections. Generally, newspapers are digitized either in-house or by vendors, and some manage as born-digital content either directly from publishers or harvesting through the web ([35]).

The state-of-the-art review of newspaper archives shows various approaches acquired for newspaper preservation, and most newspapers are digitized as a single digital record. Generally, the curated digitized records are scanned from microfilm, which is significantly reduced in size photographs, useful for storage, and magnified for reading to pdf, gif, jpg, or other graphical formats. The newspaper archives can be divided into old newspaper archives and newer newspaper archives. The old newspaper archives are hard to index by Optical Character Recognition (OCR) technology into the full-text corpus and are primarily available in graphical format. In contrast, the newer newspaper archives are fully indexed and allow full-text searching mechanisms.

The digital news stories preservation (DNSP) framework is introduced to create a digital archive of news articles linked together based on some criteria for future use ([21]). Recently, the DNSP framework is enriched to create a multilingual multi-sources digital news stories archive that will preserve digital news articles for the long term and future generations. The framework is added with two low-resource languages, that is, Urdu and Arabic languages. The challenges are identified with different aspects regarding low-resource languages that make it hard to simply include these sources in the digital news stories archive (DNSA). The study discusses different challenges related to aspects such as volume, variety and velocity during archival information package, technical challenges during creation of archive, and challenges related to the dissemination of archived content. The absence of resources for low-resources languages such as efficient tokenizers, dictionaries, and other basic resources that prompts heavy prepossessing during preservation process.

The section "Preservation Challenges in Low-resource Languages" and its subsections differentiate low-resource languages from high-resource languages, outline the challenges in LRL, metadata role in information dissemination and provide a brief overview of Urdu and Arabic languages. The section "Why We Need Metadata for DNSP Framework" presents details about the digital news stories preservation framework initiative, discusses the importance of preservation, research challenges, DNSP framework enhancement, multilingual archive, and its structure, and major issues in the implementation of enhancing the extraction tool. In section "News Extraction Results," extraction quantification is comprehensively discussed. The section "Proposed Metadata Element Set for DNSP" present the pro- posed metadata element set for DNSP framework, explicit and implicit metadata, extraction results, and discussion. The last section concludes the findings of the study.

Preservation Challenges in Low-Resource Languages

The natural language processing (NLP) tools underwent a significant change in the 1990s, transitioning from rule-based techniques to statistical-based approaches, which marked the beginning of a new era of artificial intelligence. Since then, the primary focus has been on English as an international language, with only about 20 languages out of the 7,000 languages spoken around the world being considered ([13]).

Natural languages are classified into two broad categories, that is, Low-resource Languages (LRL) and High-resource languages (HRL). Many data resources exist for high-resource languages that help machines to learn and understand natural languages, for example, English. By far, English is a well-resourced language as compared to other most spoken languages. Many West-European languages are well resource-languages, such as Chinese, Japanese and Russian, which are also considered as high-resourceful languages. In contrast, low-resource languages are languages with very few or no resources available. Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages ([5]; [29]). Many languages are difficult to preserve because they are mostly oral, and very few written resources exist in physical form, not in electronic format ([10]). There are different types of resources for natural language processing and the development of language-based systems;

  • Collection of Text in various forms, such as research papers, books, email collections, social media contents collections, and so on.
  • Lexical, syntactical, semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., wordnet), organized dependency tree corpora, and so on.
  • Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, and so on.

Many language resources are costly to produce, which is why the economic inequalities between countries are reflected in the language resources and the lack of research. Hence, many challenges face in protecting these languages from being lost.

  • Alignment or Projection technique (three levels of alignment, document, sentence and word) is a common technique for annotation. It is difficult to adopt the projection technique from HRL to LRL because of the lack of resources and different structures of target and source languages ([29]).
  • Creating a bag of words, dataset, and raw text collection for LRL is difficult, which is necessary for any natural language processing (NLP) task and mapping techniques ([29]).
  • The most important resource for any language is the lexicon of that language. Many NLP tasks heavily depend on the textual material, which is lacking in LRL and a challenging task to produce an efficient lexicon.
  • Morphology of evolving LRL and its vocabulary extended easily. Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots ([7]).
  • The major applications of NLP, such as question-answer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition systems, are very difficult to implement in low-resource languages.
  • The basic NLP tasks are also difficult in low-resource languages, such as stopwords identification and removal, tokenization, part-of-speech tagging, sentence parsing, lemmatization, stemming, and so on.
  • The NLP systems of LRL are time-consuming and less efficient comparatively as of a lack of resources, and they are even more difficult when it comes to developing a machine learning system ([13]).
  • There are many languages that are mostly oral, for which very few written resources exist (physical and digital formats). For some, there are written documents but not even a basic resource like a dictionary.
  • Integrated and customized systems are always a huge challenge for multilingual systems.

Deal with all the challenges faced by low-resource language needs, extensive research in different dimensions. Urdu and Arabic languages are two huge languages that need a lot of focus in research.

Urdu Language

Urdu is a popular South Asian language, and about 70 million native speakers and more than 164 million people speak around the world ([2]; [32]). Urdu is the official literary language of Pakistan, spoken and understood in many countries like India, and Bangladesh, and is closely related to Hindi. Urdu periodicals offer a wide range of work on imperative issues of South Asia spread over the 19th and 20th centuries, making their conservation precious for researchers of the idiom ([31]).

Arabic Language

Arabic is the third (3rd) most spoken language after English and Chinese. Around 292 million people speak Arabic as their first and official language in 27 states worldwide, and many more can understand it as a second language ([38]). The Arabic language is one of the six official languages of the United Nations, besides English, French, Spanish, Russian, and Chinese. Arabic is also becoming a popular language to learn in the Western world, and other languages have borrowed words from Arabic due to their historical significance. Grammar is sometimes tough to learn for native speakers of Indo-European languages and hence a challenge for machines to correctly interpret and understand the Arabic language ([18]; [37]).

Accessing Via Metadata

Metadata is commonly known as data about data or termed as information about the information ([33]). Metadata helps to organize electronic resources in archives or repositories. From most information fields' perspectives, Meta means an underlying definition or description. Information about structure, history, evolution, authenticity, availability, accessibility, digital signature, copyright, reproduction, and so on, is also metadata ([6]). Considering the scope of data it applies, from archaeological resources, document files, images, and videos to spreadsheets and webpages, or simply the big data, it's not surprising that understanding and managing metadata has become a high priority ([11]).

Metadata is essential in managing digital objects in libraries, archives, or digital collections. Some important roles of metadata are:

  • Resources discovery from huge collection ([12]).
  • Organizing electronic resources in digital libraries and collections ([14])).
  • Enable interoperability is the ability of different systems to exchange and use together information without losing content and functionality using metadata ([33]).
  • Certifying authenticity, reliability, integrity and provenance is ensured using metadata for digital objects ([15]).
  • Metadata also stores information about the physical characteristics and documents the behavior so that it can be emulated in future technologies ([33]).
  • During the object development phase, multi-versions of the same object may be created for preservation and dissemination.
  • Re-using data requires careful preservation and documentation of the metadata.
Newspaper Archive Sources

There are a number of archives maintained by different organizations (government and non-government) with different scopes, such as archives containing small, medium or large archives based on the number of newspapers archived and the coverage in terms of time. Many sources list these digital archives in alphabetical order or by creating different categories. For example, the United States (US) de-facto national library "The Library of Congress" [https://www.loc.gov/] provides newspaper archive, indexes and morgues list "Newspaper and Current Periodical Reading Room" ([28]). The International Coalition on Newspapers (ICON) is a multi-institutional (contain universities, colleges, and independent research libraries) efforts that promote accessibility and preservation of newspapers collection from all over the world supported by [4], which is a Global Resources Network [http://icon.crl.edu/] provide a list of newspaper digitization projects ([4]). Similarly, "Phillips Library of Mount St. Mary's University" ([30]) and "The Ancestor Hunt" ([1]) are other known sources of newspaper archives that maintain a comprehensive list. A common problem in all these lists is that many broken or dead links exist or the archive parent links are updated. Even the Wikipedia list contains many archives without any parent link, and many archives are individual newspaper archives of a very short period.

Low-resource languages, such as Arabic, have very limited digital collections, and Urdu has no such digital collection. The British Library has maintain both Arabic ([3]) and Urdu ([17]) collections contain very few books. Similarly, Harvard library maintains the Middle East and Islamic Studies library resources to safeguard the culture and heritage of the Islamic world ([16]). Digitization and preservation of old newspapers are mostly done by converting them into digital images to protect their culture and heritage for future generations. A study discussed comprehensively the contributions in preservation by different countries and organizations, which shows that the higher education organizations in the developed and technologically developed countries contributed more as compared to the developing countries ([23]). Mostly, low-resource languages are associated with developing countries and have very little focus on the preservation aspect of their cultural assets. Urdu is one of the low-resource languages with no newspaper archive or very few contents has been preserved by international archives like "The Internet Archive." The access mechanisms of the available archives are not sophisticated, and the manipulation of contents is not easy and remains inaccessible most of the time.

Why We Need Metadata for DNSP Framework?

The World Wide Web is continuously expanding due to the ever-increasing number of information sources providing information almost any time, making the repositories too dynamic and need continuous periodic updation and preservation. The information on the internet is much more volatile and fragile than that in hard form and can be vanished or altered if not smartly and efficiently tackled and archived. This information does not need uploading and adding to the repository but should provide efficient access and other services. Descriptive, technical, and administrative information must ensure access to the archived digital objects ([19]).

News is also one of the most visited and reliable information in today's world. People watch online news channels, newspapers, and other articles on the internet. Various applications and gadgets are being used, and different sources continuously contribute to providing news. All these sources offer different forms and types of content, which can't be handled by traditional old methods/strategies used for information archival. The dissemination of information is needed after the preservation and creation of archives. The news contents/articles are lost after some time because of technological changes, incompatibilities regarding hardware and software, or lack of preservation of the technical and content information, that is, metadata. Older news disappears after its lifespan, that is, 1 week, month, or maybe longer than this. Still, finally, it vanishes after its lifespan, which needs some specific way of archival for the news domain, which ensures the preservation of news for a long time and future generations. To ensure the news article's archival needs specific strategies are to be adopted for preservation with all technical and administrative aspects. The Digital News Stories Preservation (DNSP) framework is initiated to archive digital news from multiple sources in an organized form and create DNSA. Metadata is created and collected because it enables and improves the use of archived news articles ([21]).

Many metadata standards exist; some are generic and widely used as a base for other evolving standards. Metadata standards have limitations where they do not effectively work out in some specific repositories. In contrast, the domain-specific metadata standards are mainly designed for that particular domain and can't do their best elsewhere. The same problem is facing news repositories. They must also be preserved by accounting for news-specific metadata, enabling efficient preservation of the contents and efficient access using metadata. The focus is to address the better access and retrieval of news from the DNSA archive from the DNSP framework. If an archive is very well organized with no efficient accessing mechanism, then this archive is of no use if it fails to satisfy the user queries. For this purpose, the metadata elements should be sufficiently rich and able to entertain the user's questions and search for information required by the user.

DNSP Framework Enhancement

The primary purpose of the DNSP framework is to create a multilingual multi-sources digital news stories archive that will preserve digital news articles for the long term and future generations. The framework is enriched with two low-resource languages, that is, Urdu and Arabic. The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply. The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework. The workflow and main components are presented in the enhanced version of the DNSP framework in Figure 1.

Graph: Figure 1. Enhanced digital news story preservation framework for low-resourced language "Arabic" ([20]).

Multilingual News Archive

This section briefly introduces the Digital News Stories Archive (DNSA). The core initiative of the Digital News Stories Preservation (DNSP) framework is demonstrated in the field conference "the International Conference on Asian Digital Libraries 2015 (ICADL-2015)" ([21]). The following are the significant contributions to the framework;

  • A generic systematic approach was proposed as a web preservation model. The model contained ten steps for different types of projects of web preservation after analyzing 120 news archives worldwide ([23]; [22]).
  • The study created the Digital News Stories Archives (DNSA) to preserve news articles from multiple online news sources ([19]).
  • A news extractor tool, that is, Digital News Stories Extractor (DNSE), is designed for the extraction of news contents and for the creation of DNSA ([26]).
  • Based on different features, a few content-based linking mechanisms are introduced during preservation to ensure the accessibility of the archived contents in the DNSA. Text-processing techniques such as Common Ratio Measure for Similarity (CRMS) ([25]), the role of named entities in linking ([20]), and so on.
  • A comprehensive study is performed in the field of recommendation systems to understand the utility of similarity measures and refine the techniques in the DNSP framework ([9]). The framework can be enhanced in different directions and improve its utility (a few are discussed in future work) ([9]).
  • The technique "Common Ratio Measure for Similarity (CRMS)" is modified for news headings to reduce extra computation for the terms appearing in the news body for linking English news articles during preservation ([25]).
  • The technique "Common Ratio Measure for Similarity (CRMS)" is updated for linking Urdu-language news articles with English-language news articles, and the DNSA is also converted to a dual lingual archive ([24]).
  • A heading-based technique is introduced for linking English news articles for efficient linkage in the DNSA in the DNSP framework ([20]).

The digital news stories archive (DNSA) is a news article archive created offline from multiple online sources that preserved news stories in Three (3) different languages, that is, English, Urdu, and Arabic. Currently, the DNSA is archiving digital news from three (3) local news television network websites and seven (7) local online newspapers ([26]) in English, five (5) Urdu news sources, and Four (4) Arabic online news sources. The archive is created offline locally and preserves more than 1,000 new articles news in each extraction from specified news sources after removing duplicate and old URLs.

The high-level system architecture of the DNSP framework is presented in Figure 2. The figure shows the ingestion package, two functional mediators, the archive, and the search and retrieval mechanism's module. The ingestion module extracts new news URLs from the selected news sources, the mediators extract news contents, metadata and preserve the news articles, and the search module will help to disseminate the archived contents in the future, creating the Archival Information Package (AIP), as shown in Figure 3.

Graph: Figure 2. High-level system architecture.

Graph: Figure 3. Archival information package (AIP) of DNSA ([19], [20]).

The newsreaders read from different sources about a story to get a diverse and broader perspective and authenticate the information. It is challenging to navigate through a huge collection without linking mechanisms and metadata, which will help to retrieve relevant news from a multi-lingual archive for better understanding. Sophisticated linking mechanisms, well-defined meta-elements and indexing approaches are required to create and manage such a diverse collection.

Enhancing Digital News Story Extractor (DNSE)

The Digital News Story Extractor (DNSE) is a Java-based tool for extracting digital news stories from different online news websites using JSOUP, POI libraries. Initially, the DNSE is developed for English news sources ([26]), and then enhanced for Urdu news articles, now enhanced by including Arabic news sources and some features for quantification. The DNSE extract news stories from online sources, extract meta information, that is, metadata, and normalized both news content and related metadata into XML format to preserve in the DNSA. However, the enhancement is encountered the following problems as briefly discussed below;

  • Non-Uniform Web Structure: There are many plat- forms and technologies for developing web-based applications, front-end like HTML, CSS, JAVA, JAVASCRIPT and its frameworks, and back-end logic creation technologies like PHP, ASP.net, XML, and many others. Due to the use of different technologies the web structure varies and hence, a challenging task to extract the desired information.
  • Recency or Maintenance of Fresh Content: Mostly, the web contents of the dynamic web applications, such as blogs, news websites update instantly and frequently. The recency of news content is very important to maintain efficiently considering access frequency and network traffic issues.
  • Rise of Anti-scrapping tools: The biggest challenge in the extraction of news content is the rise of anti-scrapping tools, for example, Captcha, which differentiates between bot and human. The extractor got stuck when anti-scraping tools is implemented.
  • Unknown Host Issue: The unreliable internet connection leads to an unknown host issue, the extraction of news restarted after the interruption is time-consuming.
  • Socket Timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific time period during preservation. The websites consider that a bot is unnecessary to send requests and overload the server and start blocking access.
  • Garbage Collection: The inconsistency in development approaches lead to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code during news extraction.
  • Identifying and Preprocessing of Low-resource Languages: The DNSE tool deployed different libraries for the identification and preprocessing of low-resource languages and the preprocessing is computationally expensive.
  • Firewall Blocking: Few online news sources are protected from extraction using the firewall.

As extraction is important for any digital archive and becomes challenging when preserving low-resource contents. The enhanced DNSE is enabled to deal with the above challenges efficiently.

News Extraction Results

The "DNSA" is enriched with two low-resource languages, that is, Urdu and Arabic, with Five sources that provide Urdu news articles, and three online sources published news in Arabic language. The details of the included news articles from all three languages are summarized in the Table 1 below;

Graph

Table 1. News Sources in DNSA (Abbr for Abbreviation).

NoNews sourceAbbrLanguage
01DAWN NewsDNEnglish
02The TribuneTTEnglish
03The NewsTNEnglish
04Geo NewsGNEnglish
05Pakistan ObserverPOEnglish
06Pakistan TodayPTEnglish
07ARY NewsANEnglish
08Samaa NewsSNEnglish
09Voice of JournalistVJEnglish
10Time of PakistanTPEnglish
11ExpressExUrdu
12Daily PakistanDPUrdu
13Samaa UrduSUUrdu
14Geo UrduGUUrdu
15Dawn NewsDUUrdu
16Al-Jazirah OnlineAOArabic
17Al-RiazARArabic
18OkazOKArabic

The DNSP framework is gradually enhanced and the lake of resources and sufficient financial support the research progress is slow. Initially, three local English newspapers, that is, Dawn News, The Tribune and The News were selected for testing the DNSE tool ([26]).

The new extraction/crawling results after DNSE enriched with two low-resource languages, that is, Urdu and Arabic is keenly analyzed for shortcomings of the DNSP framework and DNSE tool.

In Figure 4, the extraction results are visualized for all ten sources of high resources language, that is, English. The results show that few of the news sources are not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.

Graph: Figure 4. Average new news story extraction for high-resource language "English."

Assessing the frequency of extraction of new stories is important as the news stream is continuous and not periodic like printed media.

The extraction process was performed on daily basis or waiting for some days before performing the extraction. The average number of extracted URLs and unique URLs are presented in Figure 5. The figure shows that the number of new news URLs extracted is almost equal to new news stories from the online newspaper and among online news channels.

Graph: Figure 5. Average total URLs extraction and unique URLs extraction for HRL.

The processing of low-resource languages is expensive in terms of time complexity and accuracy. The main problems with the implementation of DNSE including LRLs are non- uniform web structure, unknown host issues, and garbage collection. Figures 6 and 7 present on average extraction of new news articles and unique URLs respectively.

Graph: Figure 6. Average new news story extraction for low-resource languages "Urdu" and "Arabic."

Graph: Figure 7. Average TOTAL URLs extraction and unique URLs extraction for LRL from online news sources.

Table 2 and Figure 8 present the error rate of URLs and stories extraction during preservation for both high-resource language and low resources languages. The LRLs has a large error rate because of non-uniform web structure, unknown host issue, maintenance of fresh content, anti-scrapping tools, and garbage collection.

Graph

Table 2. Error Rate in Both HRL (English) and LRLs (Urdu and Arabic) During Extraction.

DayHRL sourcesError ratePercentageLRLs sourcesError ratePercentage
011,572810594812213
0271226044695512
0378131044724910
0474625034806413
0571619034574610
0674521034934209

Graph: Figure 8. Error rate comparison in both HRL (English) and LRLs (Urdu and Arabic) during extraction.

Proposed Metadata Element Set for DNSA

Metadata is essential as the content itself because digital content is useful when accessible. Metadata is structured information that helps to locate a digital object in the digital archive. Some metadata may not be available explicitly with news stories but may be extracted from the text of news articles. The metadata extractor module is extended to include a sub-module to collect metadata from the text of news stories.

This metadata helps link multi-lingual news articles based on similarity with other news articles stored in the archive. In DNSP, 28 explicit and implicit metadata elements are extracted from the source and from the news article (if any), which are used as descriptive metadata and administrative metadata, as shown in Tables 3 and 4, respectively.

Graph

Table 3. Explicit Metadata Element set for DNSA.

NoDNSP elementsDescriptionRepeatedOptionalNature
1TitleHeading of news articleNoNoExplicit
2Subject/topicContext of the newsNoYesExplicit
3Creator/authorAuthor or Reporter of the newsNoYesExplicit
4DescriptionBrief news descriptionNoYesExplicit
5PublisherPublisher of news (in case of Multiple sources)YesExplicit
6DateDate of news publicationExplicit
7IdentifierReference to the news (number)YesExplicit
8LanguageLanguage of the newsNoNoExplicit
9Coverage/scopeSpatial or temporal topic of the resource (National or International)NoNoExplicit
10RightsInformation about rightsNoYesExplicit
11News/storyActual or detail newsYesNoExplicit
12CategorySpecify category of the newsNoNoExplicit
13URLURL of the NewsNoNoExplicit
14Newspaper NameNewspaper or news source nameNoNoExplicit
15DayNews publication dayNoNoExplicit
16TimeNews publication timeNoNoExplicit

Graph

Table 4. Implicit Metadata Element set for DNSA.

NoDNSP elementsDescriptionRepeatedOptionalNature
17CountryCountry of news publication or orientationNoNoExplicit/implicit
18XML-MDRelated XML file containing meta- dataNoNoImplicit
19Associated News (Same language)Links of associated/ related news of Same LanguageYesNoImplicit
20Associated News (Other languages)Links of associated/ related news of Different LanguageYesYesImplicit
21Associated ImagesLinks of associated imagesYesYesImplicit
22Associated audio/videoLinks of associated audio/videoYesYesImplicit
23Term Frequency (TF) fileTerm Frequency of the news storyNoNoImplicit
24TopicTopic of the news storyYesYesImplicit
25Topic PriorityTopic may assign priorityNoYesImplicit
26Named EntitiesNamed Entities (NE) related to the newsYesYesImplicit
27NE PriorityNE may assign priorityYesYesImplicit
28CRMS ValueCRMS measure valueNoNoImplicit

Metadata Extraction Results

The Tables 5 and 6 present the metadata extraction results during the news stories preservation in DNSA for both low-resource and high-resource languages. Well-organized news websites normally keep all the explicit metadata, and a few descriptive metadata are left blank. The implicit metadata is extracted from the news stories, so almost all the meta elements are extracted, as shown in the respective tables.

Graph

Table 5. DNSP Explicit Metadata Element Set Extraction.

NoDNSP elementsNo of HRL newsPercentageNo of LRL newsPercentage
1Title12,8311003,318100
2Subject/topic11,690912,69681
3Creator/author7,242561,54146
4Description4,645361,02330
5Publisher5,589443,318100
6Date11,581903,318100
7Identifier3,2112568520
8Language12,8311003,318100
9Coverage/scope12,8311002,32069
10Rights2,4511962719
11News/story12,8311003,318100
12Category12,8311003,318100
13URL12,8311003,318100
14Newspaper name12,8311003,318100
15Day12,8311003,318100
16Time12,8311003,318100

Graph

Table 6. DNSP Implicit Metadata Element Set Extraction.

NoDNSP elementsNo of HRL newsPercentageNo of LRL newsPercentage
1County12,8311003,318100
2XML-MD12,8311003,311100
3Associated news (same language)12,8311001,93458
4Associated news (other languages)12,8311001,80154
5Associated images5,2464174122
6Associated audio/video842077402
7TF12,8311002,18565
8Topic12,8311002,04461
9Topic priority6,44950
10Named entities (NE)12,831100
11NE priority9,81176
12CRMS value128311002,19966

Extracting explicit metadata is easy in terms of accuracy and computation in both high-resource and low-resource languages. In contrast, implicit meta-element extraction in low-resource languages is computationally expensive and inaccurate compared to high-resource languages. The twelve implicit meta elements in the proposed metadata elements set are not straightforward for LRL because of the different morphological complex features of Urdu and Arabic. The DNSE tool is enhanced for all meta elements for HRL and LRL except "Topic Priority,""Named Entities," and "Named Entities Priority."

Discussion

The existing news archives can be classified into two types, that is, graphical formats and partially indexed archives. It is difficult to manipulate the contents of these archives, especially to access particular news about an event, because it encompasses many challenges. Such as;

  • Vast Archive Collections: an archive created from many sources,
  • Various Sources: having different platforms,
  • Multi-lingual Archive: an archive created from multiple languages, that is, Urdu, Arabic & English
  • Low-resource Language: this becomes more complicated when accessing news articles in low-resource languages, such as Urdu. Because of sophisticated tools, the preprocessing overhead is large compared to high-resource languages, such as English.

Besides these, there are many difficulties in digital news preservation, such as;

  • Extraction of news from many diverse sources and technological different platforms,
  • Extraction of implicit and explicit metadata,
  • Similarity value computation among news articles,
  • Transformation of news articles to a specific standard format for future integration and access, and so on.

Developing an efficient extractor is also challenging for multi-lingual content extraction, which apparently seems easy. The development of the Digital News Stories Extractor faces many challenges;

  • Non-uniform web structure: the web platforms use different technologies and frameworks for development, such as HTML, CSS, JAVA, JAVASCRIPT, PHP, XML, and many others and its frameworks.
  • The web platforms use different data structures and formats to provide news content. The preservation needs more versatility, and AI features to crawl contents from these different resources.
  • Maintaining fresh contents: The news web sources refresh contents continuously, which needs to be updated in real-time balancing user access and avoiding network traffic.
  • Broken links keep the extractor busy or lead to extracting irrelevant content.
  • The rise of anti-scrapping tools resists the extraction of news contents as the extractor tries to access the website and results in DOS (Denial of Services) or DDOS (Distributed Denial of Services) attack and gets stuck.
  • The extraction need a stable internet connection to avoid unknown host issue and keep track of logs.
  • The continuous access of a source generates socket timeout, which blocks services temporarily and is avoided by URL shuffling policy and period wait policy.
  • Garbage collection is a serious issue for targeted content extraction, especially in low-resource languages, because of diverse platforms, different technologies, inconsistent formats, new symbols, and so on, which need a sophisticated extractor.

The DNSA is an effort to create a fully texted archive to get the benefits from new technologies and approaches. The DNSP framework is developed to overcome different challenges facing the multi-lingual digital archives, including low-resource languages, and the current research is briefly added in the future work of this paper.

Conclusions and Future Work

The preservation of news and the creation of news archives is challenging. It becomes even further complicated when the archive contains articles from low-resourced and morphologically complex languages like Urdu and Arabic. The study introduced a multi-lingual news archive for Urdu, Arabic, and English news article sources from eighteen news publishing platforms. The digital news stories extractor is enhanced, addresses major issues in implementing low-resource languages, and facilitates normalized format migration. The extraction results of the proposed 28 meta-elements, including sixteen explicit and twelve implicit elements, are presented in detail for high-resource languages, that is, English, and low-resource languages, that is, Urdu and Arabic. The results showed that seventeen and ten meta elements are cent percent extracted for HRL and LRL, respectively. The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively. The framework preserved, on average, 879 news from ten HRL sources and 553 news from eight LRL news sources in the digital news stories archive.

The study presents details of how the framework is enhanced and needs a more detailed study for accurate news content extraction and archiving for future access. The framework can be extended in different dimensions in the future, such as;

  • A standard user interface is required to enable access to the archived contents of the DNSA.
  • The DNSE tool needs to be developed to professional standards.
  • The meta attributes can be developed for multi-lingual archives and other languages, such as Urdu, Arabic, Pashto, and so on.
  • More implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources.
  • Language-dependent metadata attributes can be added to the meta-set.

I would like to express my sincere gratitude to the University of Hail for their generous support and funding throughout the course of my research at the University of Swat under Project Title "Enhancing the Digital News Stories Preservation Framework for ARABIC Language" and Project ID: RG-21 090. This project would not have been possible without their financial assistance, which allowed us to conduct the necessary experiments, gather data, and analyze results.

References 1 Ancestor Hunt. (2022). Retrieved September 14, 2022, from http://www.theancestorhunt.com/blog/europe-free-online-historical-newspapers#V0exUE9SHqd (Created in 2002). 2 Andrabi S. A. B., Wahid A. (2022). Machine translation system using deep learning for English to Urdu. Computational Intelligence and Neuroscience, 2022 (pp. 1–11). 3 The British Library. (2022). Retrieved September 14, 2022, from https://www.bl.uk/collection-guides/arabic-collections; https://archive.org/ (Created in 1973). 4 Center for Research Libraries. (2022). The international coalition on newspapers (ICON). Retrieved September 14, 2022, from, http://icon.crl.edu/digitization.php (Established in 1999). 5 Cieri C., Maxwell M., Strassel S., Tracey J. (2016). Selection criteria for low resource language programs [Conference session]. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp. 4543–4549. 6 Dashrath V. B. (2014). Role of metadata in digital resource management. International Journal of Digital Library Services, 4(3), 209–2017. 7 Elkateb S., Black W. J., Vossen P., Farwell D., Rodríguez H., Pease A., Alkhalifa M., Fellbaum C. (2006). Arabic WordNet and the challenges of Arabic [Conference session]. Proceedings of the international conference on the challenge of Arabic for NLP/MT. 8 Emani C. K., Cullot N., Nicolle C. (2015). Understandable big data: A survey. Computer Science Review, 17, 70–81. 9 Feng C., Khan M., Rahman A. U., Ahmad A. (2020). News recommendation systems-accomplishments, challenges & future directions. IEEE Access, 8, 16702–16725. Goyal N., Gao C., Chaudhary V., Chen P. J., Wenzek G., Ju D., Krishnan S., Ranzato M. A., Guzmán F., Fan A. (2022). The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10, 522–538. Greenberg J. (2005). Understanding metadata and metadata schemes. Cataloging & Classification Quarterly, 40(3–4), 17–36. Greenberg J. (2010). Dublin core: History, key concepts, and evolving context (part one) [Conference session]. Slide presentation on dc-2010 international conference on Dublin core and metadata applications, Pittsburgh, PA. Guellil I., Saaˆdane H., Azouaou F., Gueni B., Nouvel D. (2021). Arabic natural language processing: An overview. Journal of King Saud University-Computer and Information Sciences, 33(5), 497–507. Habib D. P., Balliot R. L. (2000). How to search the world wide web: A tutorial for beginners and non-experts. Disponível: http://204.17, 98 Harran M., Farrelly W., Curran K. (2018). A method for verifying integrity & authenticating digital media. Applied Computing and Informatics, 14(2), 145–158. The Harvard Library. (2022). Retrieved September 14, 2022, from https://guideslibraryharvardedu/mideast/archives The Internet Archive. (2022). Retrieved September 14, 2022, from https://archive.org/ (Establishded in 1996). Kamusella T. (2017). The Arabic language: A Latin of modernity? Journal of Nationalism, Memory & Language Politics, 11(2), 117–145. Khan M. (2018). Using text processing techniques for linking news stories for digital preservation [PhD Thesis]. Faculty of Computer Science, Preston University Kohat, Islamabad Campus, HEC Pakistan. Khan M., Khan S. S., Ahmad A., Rahman A. U. (2020). The role of news title for linking during preservation process in digital archives. Library Hi Tech, 40(5), 1359–1383. Khan M., Rahman A. U. (2015). Digital news story preservation framework. In Digital libraries: Providing quality information: 17th international conference on Asia-Pacific digital libraries, ICADL 2015, Seoul, Korea, December 9–12, 2015, Vol. 9469, p. 350. Germany: Springer. Khan M., Rahman A. U. (2019). A systematic approach towards web preservation. Information Technology and Libraries, 38(1), 71–90. Khan M., Rahman A. U., Awan M. D. (2017). Exploring the digital world of newspaper archives. A Science and Technology Journal, Portugal, 32(1), 140–164. Khan M., Rahman A. U., Ahmad A., Khan S. S. (2022). A content-based technique for linking dual language news articles in an archive. Journal of Information Science, 48(1), 57–70. Khan M., Rahman A. U., Awan M. D. (2018). Term-based approach for linking digital news stories [Conference session]. Italian research conference on digital libraries, pp.127–138, Springer. Khan M., Rahman A. U., Awan M. D., Alam S. M. (2016). Normalizing digital news-stories for preservation [Conference session]. Digital information management (ICDIM), Porto, Portugal pp.85–90, IEEE. Khan M., Rahman A. U., Ullah M., Naseem R. (2020). The role of named entities in linking news articles during preservation [Conference session]. International conference on the sciences of electronics, technologies of information and telecommunications, pp.50–58, Springer. Library of Congress. (2022). Newspaper & current periodical reading room. Retrieved September 14, 2022, from http://www.loc.gov/rr/news/oltitles.html (Established in 1800). Magueresse A., Carles V., Heetderks E. (2020). Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:200607264. Phillips Library. (2022). Retrieved September 14, 2022, from http://libguides.msmary.edu/ phillipslibrary (Established in 1799). Rafique A., Rustam F., Narra M., Mehmood A., Lee E., Ashraf I. (2022). Comparative analysis of machine learning methods to detect fake news in an Urdu language corpus. PeerJ Computer Science, 8, e1004. Rehman Z., Anwar W., Bajwa U. I. (2011). Challenges in Urdu text tokenization and sentence boundary disambiguation [Conference session]. Proceedings of the 2nd workshop on South Southeast Asian natural language processing (WSSANLP), pp.40–45. Riley J. (2017). Understanding metadata. Official of NISO. Retrieved May 11, 2021, from https://www.niso.org/publications/understanding-metadata Size W. (2021). The size of the world wide web (the internet). Retrieved August 4, 2021, from https://www.worldwidewebsize.com/ Skinner K., Schultz M. (2014). Guidelines for digital newspaper preservation readiness (Technical Report), Educopia Institute. https://hcommons.org/deposits/objects/hc:12626/datastreams/CONTENT/content?download=true. UNESCO Official. (2010). UNESCO universal declaration on archives. Retrieved December 27, 2021, from https://www.ica.org/6573/reference-ocuments/universal-declaration-on-archives.html UNESCO Official. (2016). UNESCO world Arabic language day. Retrieved January 25, 2022, from https://en.unesco.org/node/267866 Wright W. (2022). A grammar of the Arabic language: Vol. II. BoD–Books on Demand. Footnotes The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research has been funded by the Scientific Research Deanship at the University of Ha'il—Saudi Arabia, through project number RG-21 090. Muzammil Khan Graph https://orcid.org/0000-0003-4656-1041

By Muzammil Khan; Yasser Alharbi; Ali Alferaidi; Talal Saad Alharbi and Kusum Yadav

Reported by Author; Author; Author; Author; Author

Titel:
Metadata for Efficient Management of Digital News Articles in Multilingual News Archives
Autor/in / Beteiligte Person: Khan, Muzammil ; Alharbi, Yasser ; Alferaidi, Ali ; Talal Saad Alharbi ; Yadav, Kusum
Link:
Zeitschrift: SAGE Open, Jg. 13 (2023-10-01)
Veröffentlichung: SAGE Publishing, 2023
Medientyp: academicJournal
ISSN: 2158-2440 (print)
DOI: 10.1177/21582440231201368
Schlagwort:
  • History of scholarship and learning. The humanities
  • AZ20-999
  • Social Sciences
Sonstiges:
  • Nachgewiesen in: Directory of Open Access Journals
  • Sprachen: English
  • Collection: LCC:History of scholarship and learning. The humanities ; LCC:Social Sciences
  • Document Type: article
  • File Description: electronic resource
  • Language: English

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

oder
oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

oder
oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.

xs 0 - 576
sm 576 - 768
md 768 - 992
lg 992 - 1200
xl 1200 - 1366
xxl 1366 -