The digital news preservation and management of low-resource languages are challenging tasks, especially in vast collections. Unique identification of individual digital objects is possible with well-defined attributes to assure efficient management, such as access, retrieval, preservation, usability, and transformability. The metadata element set is required to maximize the available attributes related to the digital objects. To create a comprehensive metadata set that contains all the necessary attributes and data about the digital news objects. It is more challenging and complicated when the archive contains articles from low-resourced and morphologically complex languages like Urdu and Arabic, which is difficult for machines to understand. The study presents challenges in low-resource languages (LRL) and research challenges. This metadata will help to link news articles based on similarity with other news articles stored in the digital news stories archive (DNSA) and ensures accessibility. In this study, we introduced 38 metadata elements set for the digital news stories preservation (DNSP) framework, of which 16 are explicit and 12 are implicit metadata elements. The paper presents how the digital news stories archive (DNSA) is enhanced to a multilingual archive and discusses the digital news stories extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, that is, English, and low-resource languages (HRL), that is, Urdu and Arabic. The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively. The metadata extraction results show that HRL sources support all metadata elements as compared to LRL. The LRL has good support for explicit meta elements and many implicit meta elements with low extraction percentages. The LRL needs a more detailed study for accurate news content extraction and archiving for future access.
Keywords: metadata; low-resource language; high-resource language; challenges in preservation; multilingual archive; information systems
The Internet is the leading resource that provides information and even holds a variety of information sources providing information related to every aspect of human life, such as weather forecasts travel deals, events happening locally and worldwide, and so on. This information can be accessed via the World Wide Web and web services ([
Though WWW is a fast-growing source of information, it is fragile in nature. The information fragility causes this valuable scholarly, cultural, and scientific information to vanish and become inaccessible to future generations. Therefore, there is a need to preserve the information available in different forms.
The newspaper has been the main source of information for thousands of years. The newspapers cover information related to different aspects of human life and provide information about the events happening locally and worldwide. Newspapers cover stories about various events like acts of parliaments, events of political importance for countries, proceedings of courts related to important cases, births, deaths, marriages, sports, science, technology, and so on. Newspapers reflect the social life, behaviors, and cultural values of different communities, and thence these are vital scholarly information for community individuals and even the community as a whole. To ensure that this information is available to future generations. For example, the prime minister's address to the assembly after winning an election or the packages announced of an imminent foreign invasion of a country becomes valuable to future generations as historical manuscripts are today. According to the UNESCO declaration on archives, the archives play a vital role in the development of societies by safeguarding the contributions of individuals and communities (UNESCO: Universal declaration on archives. In Adopted at the ICA Annual General Meeting in Malta ([
The state-of-the-art review of newspaper archives shows various approaches acquired for newspaper preservation, and most newspapers are digitized as a single digital record. Generally, the curated digitized records are scanned from microfilm, which is significantly reduced in size photographs, useful for storage, and magnified for reading to pdf, gif, jpg, or other graphical formats. The newspaper archives can be divided into old newspaper archives and newer newspaper archives. The old newspaper archives are hard to index by Optical Character Recognition (OCR) technology into the full-text corpus and are primarily available in graphical format. In contrast, the newer newspaper archives are fully indexed and allow full-text searching mechanisms.
The digital news stories preservation (DNSP) framework is introduced to create a digital archive of news articles linked together based on some criteria for future use ([
The section "Preservation Challenges in Low-resource Languages" and its subsections differentiate low-resource languages from high-resource languages, outline the challenges in LRL, metadata role in information dissemination and provide a brief overview of Urdu and Arabic languages. The section "Why We Need Metadata for DNSP Framework" presents details about the digital news stories preservation framework initiative, discusses the importance of preservation, research challenges, DNSP framework enhancement, multilingual archive, and its structure, and major issues in the implementation of enhancing the extraction tool. In section "News Extraction Results," extraction quantification is comprehensively discussed. The section "Proposed Metadata Element Set for DNSP" present the pro- posed metadata element set for DNSP framework, explicit and implicit metadata, extraction results, and discussion. The last section concludes the findings of the study.
The natural language processing (NLP) tools underwent a significant change in the 1990s, transitioning from rule-based techniques to statistical-based approaches, which marked the beginning of a new era of artificial intelligence. Since then, the primary focus has been on English as an international language, with only about 20 languages out of the 7,000 languages spoken around the world being considered ([
Natural languages are classified into two broad categories, that is, Low-resource Languages (LRL) and High-resource languages (HRL). Many data resources exist for high-resource languages that help machines to learn and understand natural languages, for example, English. By far, English is a well-resourced language as compared to other most spoken languages. Many West-European languages are well resource-languages, such as Chinese, Japanese and Russian, which are also considered as high-resourceful languages. In contrast, low-resource languages are languages with very few or no resources available. Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages ([
- Collection of Text in various forms, such as research papers, books, email collections, social media contents collections, and so on.
- Lexical, syntactical, semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., wordnet), organized dependency tree corpora, and so on.
- Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, and so on.
Many language resources are costly to produce, which is why the economic inequalities between countries are reflected in the language resources and the lack of research. Hence, many challenges face in protecting these languages from being lost.
- Alignment or Projection technique (three levels of alignment, document, sentence and word) is a common technique for annotation. It is difficult to adopt the projection technique from HRL to LRL because of the lack of resources and different structures of target and source languages ([
29 ]). - Creating a bag of words, dataset, and raw text collection for LRL is difficult, which is necessary for any natural language processing (NLP) task and mapping techniques ([
29 ]). - The most important resource for any language is the lexicon of that language. Many NLP tasks heavily depend on the textual material, which is lacking in LRL and a challenging task to produce an efficient lexicon.
- Morphology of evolving LRL and its vocabulary extended easily. Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots ([
7 ]). - The major applications of NLP, such as question-answer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition systems, are very difficult to implement in low-resource languages.
- The basic NLP tasks are also difficult in low-resource languages, such as stopwords identification and removal, tokenization, part-of-speech tagging, sentence parsing, lemmatization, stemming, and so on.
- The NLP systems of LRL are time-consuming and less efficient comparatively as of a lack of resources, and they are even more difficult when it comes to developing a machine learning system ([
13 ]). - There are many languages that are mostly oral, for which very few written resources exist (physical and digital formats). For some, there are written documents but not even a basic resource like a dictionary.
- Integrated and customized systems are always a huge challenge for multilingual systems.
Deal with all the challenges faced by low-resource language needs, extensive research in different dimensions. Urdu and Arabic languages are two huge languages that need a lot of focus in research.
Urdu is a popular South Asian language, and about 70 million native speakers and more than 164 million people speak around the world ([
Arabic is the third (3rd) most spoken language after English and Chinese. Around 292 million people speak Arabic as their first and official language in 27 states worldwide, and many more can understand it as a second language ([
Metadata is commonly known as data about data or termed as information about the information ([
Metadata is essential in managing digital objects in libraries, archives, or digital collections. Some important roles of metadata are:
- Resources discovery from huge collection ([
12 ]). - Organizing electronic resources in digital libraries and collections ([
14 ])). - Enable interoperability is the ability of different systems to exchange and use together information without losing content and functionality using metadata ([
33 ]). - Certifying authenticity, reliability, integrity and provenance is ensured using metadata for digital objects ([
15 ]). - Metadata also stores information about the physical characteristics and documents the behavior so that it can be emulated in future technologies ([
33 ]). - During the object development phase, multi-versions of the same object may be created for preservation and dissemination.
- Re-using data requires careful preservation and documentation of the metadata.
There are a number of archives maintained by different organizations (government and non-government) with different scopes, such as archives containing small, medium or large archives based on the number of newspapers archived and the coverage in terms of time. Many sources list these digital archives in alphabetical order or by creating different categories. For example, the United States (US) de-facto national library "The Library of Congress" [https://www.loc.gov/] provides newspaper archive, indexes and morgues list "Newspaper and Current Periodical Reading Room" ([
Low-resource languages, such as Arabic, have very limited digital collections, and Urdu has no such digital collection. The British Library has maintain both Arabic ([
The World Wide Web is continuously expanding due to the ever-increasing number of information sources providing information almost any time, making the repositories too dynamic and need continuous periodic updation and preservation. The information on the internet is much more volatile and fragile than that in hard form and can be vanished or altered if not smartly and efficiently tackled and archived. This information does not need uploading and adding to the repository but should provide efficient access and other services. Descriptive, technical, and administrative information must ensure access to the archived digital objects ([
News is also one of the most visited and reliable information in today's world. People watch online news channels, newspapers, and other articles on the internet. Various applications and gadgets are being used, and different sources continuously contribute to providing news. All these sources offer different forms and types of content, which can't be handled by traditional old methods/strategies used for information archival. The dissemination of information is needed after the preservation and creation of archives. The news contents/articles are lost after some time because of technological changes, incompatibilities regarding hardware and software, or lack of preservation of the technical and content information, that is, metadata. Older news disappears after its lifespan, that is, 1 week, month, or maybe longer than this. Still, finally, it vanishes after its lifespan, which needs some specific way of archival for the news domain, which ensures the preservation of news for a long time and future generations. To ensure the news article's archival needs specific strategies are to be adopted for preservation with all technical and administrative aspects. The Digital News Stories Preservation (DNSP) framework is initiated to archive digital news from multiple sources in an organized form and create DNSA. Metadata is created and collected because it enables and improves the use of archived news articles ([
Many metadata standards exist; some are generic and widely used as a base for other evolving standards. Metadata standards have limitations where they do not effectively work out in some specific repositories. In contrast, the domain-specific metadata standards are mainly designed for that particular domain and can't do their best elsewhere. The same problem is facing news repositories. They must also be preserved by accounting for news-specific metadata, enabling efficient preservation of the contents and efficient access using metadata. The focus is to address the better access and retrieval of news from the DNSA archive from the DNSP framework. If an archive is very well organized with no efficient accessing mechanism, then this archive is of no use if it fails to satisfy the user queries. For this purpose, the metadata elements should be sufficiently rich and able to entertain the user's questions and search for information required by the user.
The primary purpose of the DNSP framework is to create a multilingual multi-sources digital news stories archive that will preserve digital news articles for the long term and future generations. The framework is enriched with two low-resource languages, that is, Urdu and Arabic. The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply. The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework. The workflow and main components are presented in the enhanced version of the DNSP framework in Figure 1.
Graph: Figure 1. Enhanced digital news story preservation framework for low-resourced language "Arabic" ([
This section briefly introduces the Digital News Stories Archive (DNSA). The core initiative of the Digital News Stories Preservation (DNSP) framework is demonstrated in the field conference "the International Conference on Asian Digital Libraries 2015 (ICADL-2015)" ([
- A generic systematic approach was proposed as a web preservation model. The model contained ten steps for different types of projects of web preservation after analyzing 120 news archives worldwide ([
23 ]; [22 ]). - The study created the Digital News Stories Archives (DNSA) to preserve news articles from multiple online news sources ([
19 ]). - A news extractor tool, that is, Digital News Stories Extractor (DNSE), is designed for the extraction of news contents and for the creation of DNSA ([
26 ]). - Based on different features, a few content-based linking mechanisms are introduced during preservation to ensure the accessibility of the archived contents in the DNSA. Text-processing techniques such as Common Ratio Measure for Similarity (CRMS) ([
25 ]), the role of named entities in linking ([20 ]), and so on. - A comprehensive study is performed in the field of recommendation systems to understand the utility of similarity measures and refine the techniques in the DNSP framework ([
9 ]). The framework can be enhanced in different directions and improve its utility (a few are discussed in future work) ([9 ]). - The technique "Common Ratio Measure for Similarity (CRMS)" is modified for news headings to reduce extra computation for the terms appearing in the news body for linking English news articles during preservation ([
25 ]). - The technique "Common Ratio Measure for Similarity (CRMS)" is updated for linking Urdu-language news articles with English-language news articles, and the DNSA is also converted to a dual lingual archive ([
24 ]). - A heading-based technique is introduced for linking English news articles for efficient linkage in the DNSA in the DNSP framework ([
20 ]).
The digital news stories archive (DNSA) is a news article archive created offline from multiple online sources that preserved news stories in Three (
The high-level system architecture of the DNSP framework is presented in Figure 2. The figure shows the ingestion package, two functional mediators, the archive, and the search and retrieval mechanism's module. The ingestion module extracts new news URLs from the selected news sources, the mediators extract news contents, metadata and preserve the news articles, and the search module will help to disseminate the archived contents in the future, creating the Archival Information Package (AIP), as shown in Figure 3.
Graph: Figure 2. High-level system architecture.
Graph: Figure 3. Archival information package (AIP) of DNSA ([
The newsreaders read from different sources about a story to get a diverse and broader perspective and authenticate the information. It is challenging to navigate through a huge collection without linking mechanisms and metadata, which will help to retrieve relevant news from a multi-lingual archive for better understanding. Sophisticated linking mechanisms, well-defined meta-elements and indexing approaches are required to create and manage such a diverse collection.
The Digital News Story Extractor (DNSE) is a Java-based tool for extracting digital news stories from different online news websites using JSOUP, POI libraries. Initially, the DNSE is developed for English news sources ([
- Non-Uniform Web Structure: There are many plat- forms and technologies for developing web-based applications, front-end like HTML, CSS, JAVA, JAVASCRIPT and its frameworks, and back-end logic creation technologies like PHP, ASP.net, XML, and many others. Due to the use of different technologies the web structure varies and hence, a challenging task to extract the desired information.
- Recency or Maintenance of Fresh Content: Mostly, the web contents of the dynamic web applications, such as blogs, news websites update instantly and frequently. The recency of news content is very important to maintain efficiently considering access frequency and network traffic issues.
- Rise of Anti-scrapping tools: The biggest challenge in the extraction of news content is the rise of anti-scrapping tools, for example, Captcha, which differentiates between bot and human. The extractor got stuck when anti-scraping tools is implemented.
- Unknown Host Issue: The unreliable internet connection leads to an unknown host issue, the extraction of news restarted after the interruption is time-consuming.
- Socket Timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific time period during preservation. The websites consider that a bot is unnecessary to send requests and overload the server and start blocking access.
- Garbage Collection: The inconsistency in development approaches lead to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code during news extraction.
- Identifying and Preprocessing of Low-resource Languages: The DNSE tool deployed different libraries for the identification and preprocessing of low-resource languages and the preprocessing is computationally expensive.
- Firewall Blocking: Few online news sources are protected from extraction using the firewall.
As extraction is important for any digital archive and becomes challenging when preserving low-resource contents. The enhanced DNSE is enabled to deal with the above challenges efficiently.
The "DNSA" is enriched with two low-resource languages, that is, Urdu and Arabic, with Five sources that provide Urdu news articles, and three online sources published news in Arabic language. The details of the included news articles from all three languages are summarized in the Table 1 below;
Graph
Table 1. News Sources in DNSA (Abbr for Abbreviation).
No News source Abbr Language 01 DAWN News DN English 02 The Tribune TT English 03 The News TN English 04 Geo News GN English 05 Pakistan Observer PO English 06 Pakistan Today PT English 07 ARY News AN English 08 Samaa News SN English 09 Voice of Journalist VJ English 10 Time of Pakistan TP English 11 Express Ex Urdu 12 Daily Pakistan DP Urdu 13 Samaa Urdu SU Urdu 14 Geo Urdu GU Urdu 15 Dawn News DU Urdu 16 Al-Jazirah Online AO Arabic 17 Al-Riaz AR Arabic 18 Okaz OK Arabic
The DNSP framework is gradually enhanced and the lake of resources and sufficient financial support the research progress is slow. Initially, three local English newspapers, that is, Dawn News, The Tribune and The News were selected for testing the DNSE tool ([
The new extraction/crawling results after DNSE enriched with two low-resource languages, that is, Urdu and Arabic is keenly analyzed for shortcomings of the DNSP framework and DNSE tool.
In Figure 4, the extraction results are visualized for all ten sources of high resources language, that is, English. The results show that few of the news sources are not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.
Graph: Figure 4. Average new news story extraction for high-resource language "English."
Assessing the frequency of extraction of new stories is important as the news stream is continuous and not periodic like printed media.
The extraction process was performed on daily basis or waiting for some days before performing the extraction. The average number of extracted URLs and unique URLs are presented in Figure 5. The figure shows that the number of new news URLs extracted is almost equal to new news stories from the online newspaper and among online news channels.
Graph: Figure 5. Average total URLs extraction and unique URLs extraction for HRL.
The processing of low-resource languages is expensive in terms of time complexity and accuracy. The main problems with the implementation of DNSE including LRLs are non- uniform web structure, unknown host issues, and garbage collection. Figures 6 and 7 present on average extraction of new news articles and unique URLs respectively.
Graph: Figure 6. Average new news story extraction for low-resource languages "Urdu" and "Arabic."
Graph: Figure 7. Average TOTAL URLs extraction and unique URLs extraction for LRL from online news sources.
Table 2 and Figure 8 present the error rate of URLs and stories extraction during preservation for both high-resource language and low resources languages. The LRLs has a large error rate because of non-uniform web structure, unknown host issue, maintenance of fresh content, anti-scrapping tools, and garbage collection.
Graph
Table 2. Error Rate in Both HRL (English) and LRLs (Urdu and Arabic) During Extraction.
Day HRL sources Error rate Percentage LRLs sources Error rate Percentage 01 1,572 81 05 948 122 13 02 712 26 04 469 55 12 03 781 31 04 472 49 10 04 746 25 03 480 64 13 05 716 19 03 457 46 10 06 745 21 03 493 42 09
Graph: Figure 8. Error rate comparison in both HRL (English) and LRLs (Urdu and Arabic) during extraction.
Metadata is essential as the content itself because digital content is useful when accessible. Metadata is structured information that helps to locate a digital object in the digital archive. Some metadata may not be available explicitly with news stories but may be extracted from the text of news articles. The metadata extractor module is extended to include a sub-module to collect metadata from the text of news stories.
This metadata helps link multi-lingual news articles based on similarity with other news articles stored in the archive. In DNSP, 28 explicit and implicit metadata elements are extracted from the source and from the news article (if any), which are used as descriptive metadata and administrative metadata, as shown in Tables 3 and 4, respectively.
Graph
Table 3. Explicit Metadata Element set for DNSA.
No DNSP elements Description Repeated Optional Nature 1 Title Heading of news article No No Explicit 2 Subject/topic Context of the news No Yes Explicit 3 Creator/author Author or Reporter of the news No Yes Explicit 4 Description Brief news description No Yes Explicit 5 Publisher Publisher of news (in case of Multiple sources) Yes Explicit 6 Date Date of news publication Explicit 7 Identifier Reference to the news (number) Yes Explicit 8 Language Language of the news No No Explicit 9 Coverage/scope Spatial or temporal topic of the resource (National or International) No No Explicit 10 Rights Information about rights No Yes Explicit 11 News/story Actual or detail news Yes No Explicit 12 Category Specify category of the news No No Explicit 13 URL URL of the News No No Explicit 14 Newspaper Name Newspaper or news source name No No Explicit 15 Day News publication day No No Explicit 16 Time News publication time No No Explicit
Graph
Table 4. Implicit Metadata Element set for DNSA.
No DNSP elements Description Repeated Optional Nature 17 Country Country of news publication or orientation No No Explicit/implicit 18 XML-MD Related XML file containing meta- data No No Implicit 19 Associated News (Same language) Links of associated/ related news of Same Language Yes No Implicit 20 Associated News (Other languages) Links of associated/ related news of Different Language Yes Yes Implicit 21 Associated Images Links of associated images Yes Yes Implicit 22 Associated audio/video Links of associated audio/video Yes Yes Implicit 23 Term Frequency (TF) file Term Frequency of the news story No No Implicit 24 Topic Topic of the news story Yes Yes Implicit 25 Topic Priority Topic may assign priority No Yes Implicit 26 Named Entities Named Entities (NE) related to the news Yes Yes Implicit 27 NE Priority NE may assign priority Yes Yes Implicit 28 CRMS Value CRMS measure value No No Implicit
The Tables 5 and 6 present the metadata extraction results during the news stories preservation in DNSA for both low-resource and high-resource languages. Well-organized news websites normally keep all the explicit metadata, and a few descriptive metadata are left blank. The implicit metadata is extracted from the news stories, so almost all the meta elements are extracted, as shown in the respective tables.
Graph
Table 5. DNSP Explicit Metadata Element Set Extraction.
No DNSP elements No of HRL news Percentage No of LRL news Percentage 1 Title 12,831 100 3,318 100 2 Subject/topic 11,690 91 2,696 81 3 Creator/author 7,242 56 1,541 46 4 Description 4,645 36 1,023 30 5 Publisher 5,589 44 3,318 100 6 Date 11,581 90 3,318 100 7 Identifier 3,211 25 685 20 8 Language 12,831 100 3,318 100 9 Coverage/scope 12,831 100 2,320 69 10 Rights 2,451 19 627 19 11 News/story 12,831 100 3,318 100 12 Category 12,831 100 3,318 100 13 URL 12,831 100 3,318 100 14 Newspaper name 12,831 100 3,318 100 15 Day 12,831 100 3,318 100 16 Time 12,831 100 3,318 100
Graph
Table 6. DNSP Implicit Metadata Element Set Extraction.
No DNSP elements No of HRL news Percentage No of LRL news Percentage 1 County 12,831 100 3,318 100 2 XML-MD 12,831 100 3,311 100 3 Associated news (same language) 12,831 100 1,934 58 4 Associated news (other languages) 12,831 100 1,801 54 5 Associated images 5,246 41 741 22 6 Associated audio/video 842 07 74 02 7 TF 12,831 100 2,185 65 8 Topic 12,831 100 2,044 61 9 Topic priority 6,449 50 — — 10 Named entities (NE) 12,831 100 — — 11 NE priority 9,811 76 — — 12 CRMS value 12831 100 2,199 66
Extracting explicit metadata is easy in terms of accuracy and computation in both high-resource and low-resource languages. In contrast, implicit meta-element extraction in low-resource languages is computationally expensive and inaccurate compared to high-resource languages. The twelve implicit meta elements in the proposed metadata elements set are not straightforward for LRL because of the different morphological complex features of Urdu and Arabic. The DNSE tool is enhanced for all meta elements for HRL and LRL except "Topic Priority,""Named Entities," and "Named Entities Priority."
The existing news archives can be classified into two types, that is, graphical formats and partially indexed archives. It is difficult to manipulate the contents of these archives, especially to access particular news about an event, because it encompasses many challenges. Such as;
- Vast Archive Collections: an archive created from many sources,
- Various Sources: having different platforms,
- Multi-lingual Archive: an archive created from multiple languages, that is, Urdu, Arabic & English
- Low-resource Language: this becomes more complicated when accessing news articles in low-resource languages, such as Urdu. Because of sophisticated tools, the preprocessing overhead is large compared to high-resource languages, such as English.
Besides these, there are many difficulties in digital news preservation, such as;
- Extraction of news from many diverse sources and technological different platforms,
- Extraction of implicit and explicit metadata,
- Similarity value computation among news articles,
- Transformation of news articles to a specific standard format for future integration and access, and so on.
Developing an efficient extractor is also challenging for multi-lingual content extraction, which apparently seems easy. The development of the Digital News Stories Extractor faces many challenges;
- Non-uniform web structure: the web platforms use different technologies and frameworks for development, such as HTML, CSS, JAVA, JAVASCRIPT, PHP, XML, and many others and its frameworks.
- The web platforms use different data structures and formats to provide news content. The preservation needs more versatility, and AI features to crawl contents from these different resources.
- Maintaining fresh contents: The news web sources refresh contents continuously, which needs to be updated in real-time balancing user access and avoiding network traffic.
- Broken links keep the extractor busy or lead to extracting irrelevant content.
- The rise of anti-scrapping tools resists the extraction of news contents as the extractor tries to access the website and results in DOS (Denial of Services) or DDOS (Distributed Denial of Services) attack and gets stuck.
- The extraction need a stable internet connection to avoid unknown host issue and keep track of logs.
- The continuous access of a source generates socket timeout, which blocks services temporarily and is avoided by URL shuffling policy and period wait policy.
- Garbage collection is a serious issue for targeted content extraction, especially in low-resource languages, because of diverse platforms, different technologies, inconsistent formats, new symbols, and so on, which need a sophisticated extractor.
The DNSA is an effort to create a fully texted archive to get the benefits from new technologies and approaches. The DNSP framework is developed to overcome different challenges facing the multi-lingual digital archives, including low-resource languages, and the current research is briefly added in the future work of this paper.
The preservation of news and the creation of news archives is challenging. It becomes even further complicated when the archive contains articles from low-resourced and morphologically complex languages like Urdu and Arabic. The study introduced a multi-lingual news archive for Urdu, Arabic, and English news article sources from eighteen news publishing platforms. The digital news stories extractor is enhanced, addresses major issues in implementing low-resource languages, and facilitates normalized format migration. The extraction results of the proposed 28 meta-elements, including sixteen explicit and twelve implicit elements, are presented in detail for high-resource languages, that is, English, and low-resource languages, that is, Urdu and Arabic. The results showed that seventeen and ten meta elements are cent percent extracted for HRL and LRL, respectively. The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively. The framework preserved, on average, 879 news from ten HRL sources and 553 news from eight LRL news sources in the digital news stories archive.
The study presents details of how the framework is enhanced and needs a more detailed study for accurate news content extraction and archiving for future access. The framework can be extended in different dimensions in the future, such as;
- A standard user interface is required to enable access to the archived contents of the DNSA.
- The DNSE tool needs to be developed to professional standards.
- The meta attributes can be developed for multi-lingual archives and other languages, such as Urdu, Arabic, Pashto, and so on.
- More implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources.
- Language-dependent metadata attributes can be added to the meta-set.
I would like to express my sincere gratitude to the University of Hail for their generous support and funding throughout the course of my research at the University of Swat under Project Title "Enhancing the Digital News Stories Preservation Framework for ARABIC Language" and Project ID: RG-21 090. This project would not have been possible without their financial assistance, which allowed us to conduct the necessary experiments, gather data, and analyze results.
By Muzammil Khan; Yasser Alharbi; Ali Alferaidi; Talal Saad Alharbi and Kusum Yadav
Reported by Author; Author; Author; Author; Author