The developed world has focused on Web preservation compared to the developing world, especially news preservation for future generations. However, the news published online is volatile because of constant changes in the technologies used to disseminate information and the formats used for publication. News preservation became more complicated and challenging when the archive began to contain articles from low-resourced and morphologically complex languages like Urdu and Arabic, along with English news articles. The digital news story preservation framework is enriched with eighteen sources for Urdu, Arabic, and English news sources. This study presents challenges in low-resource languages (LRLs), research challenges, and details of how the framework is enhanced. In this paper, we introduce a multilingual news archive and discuss the digital news story extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, i.e., English, and low-resource languages, i.e., Urdu and Arabic. LRLs encountered a high error rate during preservation compared to high-resource languages (HRLs), corresponding to 10% and 03%, respectively. The extraction results show that few news sources are not regularly updated and release few new news stories online. LRLs require more detailed study for accurate news content extraction and archiving for future access. LRLs and HRLs enrich the digital news story preservation (DNSP) framework. The Digital News Stories Archive (DNSA) preserves a huge number of news articles from multiple news sources in LRLs and HRLs. This paper presents research challenges encountered during the preservation of Urdu and Arabic-language news articles to create a multilingual news archive. The second part of the paper compares two bilingual linking mechanisms for Urdu-to-English-language news articles in the DNSA: the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) with the cosine similarity measure (CSM) baseline technique. The experimental results show that the SMTW is more effective than the CRMDL and CSM for linking Urdu-to-English news articles. The precision improved from 46% and 50% to 60%, and the recall improved from 64% and 67% to 82% for CSM, CRMDL, and SMTW, respectively, with improved impact of common terms as well.
Keywords: low-resource languages; high-resource language; challenges in preservation; multilingual archive; bilingual news; information systems
The Internet serves as the primary and extensive hub of information, encompassing diverse sources that cover every aspect of human existence. It offers a wide range of data, including weather forecasts, travel deals, local and global events, and much more. This vast pool of information is accessible through the World Wide Web and various Web services [[
Despite its rapid growth, the World Wide Web (WWW) is inherently fragile, which poses a significant challenge. The fragility of information on the Web leads to the unfortunate disappearance and inaccessibility of valuable scholarly, cultural, and scientific resources for future generations. Consequently, it becomes imperative to prioritize the preservation of this diverse and valuable information, which exists in various forms.
For hundreds of years, newspapers have served as the primary source of information, covering a wide range of topics that encompass various aspects of human life. They provide valuable insights into local and global events, including parliamentary activities, politically significant occurrences, court proceedings, births, deaths, marriages, sports, science, technology, and more. Newspapers reflect societal life, capturing social dynamics, behaviors, and cultural values, thus serving as essential scholarly information for individuals and communities. Given the significance of preserving such information for future generations, efforts have been made to ensure its availability. For example, historical manuscripts now hold immense value, just as addresses made by prime ministers after election victories or announcements related to imminent foreign invasions. The UNESCO Declaration on Archives emphasizes the crucial role played by archives in societal development by safeguarding the contributions of individuals and communities [[
A literature review of newspaper archives reveals that diverse approaches are employed for the preservation of newspapers, with the majority being digitized as a single digital record. Typically, curated digitized records are created by scanning microfilm, which is a compact photograph that can be stored and enlarged for reading, then saved in formats such as pdf, gif, jpg, or other graphical formats. Newspaper archives can be categorized into two main types: old and newer archives. Old newspaper archives pose challenges for optical character recognition (OCR) technology in indexing them into a full-text corpus. As a result, these archives are primarily available in graphical format, requiring visual inspection to access the content. Conversely, the newer newspaper archives have been extensively indexed, allowing for efficient full-text searching mechanisms to retrieve specific information.
The digital news story preservation (DNSP) framework was introduced to establish a digital archive of news articles interconnected based on specific criteria with the purpose of future utilization [[
Establishing linking mechanisms and metadata is crucial during the preservation process to ensure the efficient dissemination of archived news articles sourced from multiple languages and diverse sources [[
Section 2 and subsections differentiate low-resource languages from high-resource languages and introduce challenges in LRLs, with a brief overview of the Urdu and Arabic languages. Section 3 presents details about the digital news story preservation framework initiative, the importance of preservation, research challenges, DNSP framework enhancement, the multilingual archive and its structure, and major issues encountered in enhancing the extraction tool. In Section 4, extraction quantification is comprehensively discussed. Section 6.3.2 compares the results of bilingual linking mechanisms, and in Section 7, the findings are summarized.
Natural languages are classified into two broad categories, i.e., low-resource Languages (LRLs) and high-resource languages (HRLs). For high-resource languages, many data resources exist that help to enable machines to learn and understand natural languages, e.g., English. English is a well-resourced language as compared to other spoken languages. Many western European languages are well-resource-covered languages. Chinese, Japanese, and Russian are also high-resource languages. In contrast, low-resource languages have very few or no resources available. Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages [[
- Collection of text in various forms, such as research papers, books, email collections, social media content collections, etc.;
- Lexical, syntactic, and semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., WordNet), organized dependency tree corpora, etc.;
- Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, etc.
Many language resources are costly to produce, which is why the economic inequalities between countries/languages are reflected in the amount (or absence) of language resources.
The natural language processing (NLP) tools experienced a drastic change in the 1990s, shifting from rule-based techniques to statistically based approaches, and a new era of artificial intelligence started. Since then, the focus has majorly been on English as an international language, and about 20 languages out of 7000 languages of the world have been considered [[
Languages that need a lot of research are often referred to as low-resource languages and face many challenges, as briefly discussed below:
- Alignment or the projection technique (three levels of alignment, i.e., word, sentence, and document) is a common technique for annotation. It is difficult to adopt the projection technique from HRLs to LRLs because of a lack of resources and different structures of target and source languages [[
8 ]]; - Creating a bag of words, dataset, and raw text collection for LRLs is difficult but necessary for any natural language processing (NLP) task or mapping technique [[
8 ]]; - The most important resource for any language is the lexicon of that language; many NLP tasks heavily depend on the textual material available, which is lacking in LRLs, making it a challenging task to produce an efficient lexicon;
- The morphology of LRLs is constantly evolving, with vocabulary easily extended. Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots [[
11 ]]; - The major applications of NLP, such as question–answer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition-based systems, are very difficult to implement in low-resource languages;
- Basic NLP tasks such as stop-word identification and removal, tokenization, part-of-speech tagging, sentence parsing, lemmatization, stemming, etc., are also difficult in low-resource languages;
- The NLP systems of LRLs are time-consuming and comparatively less efficient as a result of a lack of resources, increasing the difficulty of developing a machine learning system [[
10 ]]; - Many languages are mostly oral, for which very few written resources exist (physical and digital formats). For some, there are written documents but not even a basic resource like a dictionary;
- Integrated and customized systems are always a huge challenge for multilingual systems.
Dealing with all the challenges faced by low-resource languages requires extensive research in different dimensions. Urdu and Arabic are two huge languages that need a lot of focus in research.
Urdu, a prominent South Asian language, boasts approximately 70 million native speakers and over 164 million speakers worldwide [[
Arabic is the third most widely spoken language globally, trailing behind English and French. Approximately 292 million individuals use Arabic as their primary and official language across twenty-seven countries, with a significant number of people also capable of understanding it as a second language. Alongside English, French, Spanish, Russian, and Chinese, Arabic holds the distinction of being one of the official languages of the United Nations. Notably, Arabic is gaining popularity as a language to learn in the Western world, and numerous other languages have borrowed words from Arabic due to their historical significance. The intricacies of Arabic grammar can pose a challenge, both for native speakers of Indo-European languages and for machines attempting to accurately interpret and comprehend the Arabic language [[
The "Digital News Story Preservation (DNSP) Framework" was initiated in 2015 [[
Preserving access to historical records is crucial for writers conducting research; however, these preserved documents may be at risk of disappearing. While digital news content is widely available today, it is more vulnerable than print and can be scattered across different media and storage systems [[
The importance of digital news story preservation is summarized as follows:
- Preservation and data backup are distinct concepts and should not be conflated. Digital media are susceptible to various failures, including file corruption, viruses, malware, damage, overwritten backups, server issues, and even natural disasters like earthquakes and tornadoes. These risks highlight the need for proper digital preservation, which encompasses a set of processes and activities specifically designed to ensure the long-term, sustained storage, access to, and interpretation of digital information. Preservation goes beyond mere data backup and focuses on maintaining the integrity and accessibility of digital content over time;
- According to a survey by Educopia conducted in 2012, out of 60 newspaper companies, less than half keep their data and content for more than five years [[
16 ]]. Their news contents are also dispersed and distributed over multiple servers; - The preservation of news is vital for the advancement of society, as it enables citizens to stay well-informed and up-to-date on a wide range of events and news through journalism. By preserving news content, individuals are empowered with knowledge, ensuring they are equipped to make informed decisions and actively participate in their communities;
- Preserving news is beneficial for writers and researchers, as it empowers them to craft relevant and contextualized stories. News preservation holds immense value as a historical record for society on a large scale, offering significant benefits;
- Researchers requires information such as birth, death, marital status, business data, announcements, legal documents, property transaction papers, etc., with respect to the genealogical status of a person or even a community as a whole, which can be obtained from news or published document archives;
- Developed nations are actively engaged in archiving significant documents and newspapers and preserving news content to ensure future accessibility. When considering our social heritage, digital preservation becomes more important, especially during times when society relies on journalists to thoroughly investigate stories and produce impactful news;
- Urdu news preservation: To uphold the rich heritage of the Urdu language, it is crucial for Urdu speakers to have a sincere commitment to preserving its essence. Urdu periodicals contain a diverse array of literary works encompassing significant topics from South Asia during the nineteenth and twentieth centuries, making their preservation highly valuable for researchers interested in the language. The preservation of Urdu news stories should be particularly significant for the people of Pakistan, as it is a country where Urdu is widely spoken and holds a unique position in the world [[
17 ]]; - Arabic news preservation: Arabic is gaining popularity as a language of interest in the Western world, attracting an increasing number of learners. Throughout history, other languages have borrowed words from Arabic due to their significant contributions. However, the grammar of Arabic poses a challenge for native speakers of Indo-European languages and even for machines attempting to accurately interpret and comprehend the language [[
4 ]]. Arabic encompasses multiple versions of its script, including standard Arabic, classical Arabic, literary Arabic, and modernized Arabic. During the early Middle Ages, Arabic played a central role as a primary source for science, mathematics, culture, and philosophy. Preserving Arabic scripts is crucial, as they captures various grammatical changes that have occurred, reflecting the nuances found in colloquial variants.
The news encompasses a wide range of events that are directly or indirectly connected to our social lives. These events include parliamentary actions, significant political occurrences for countries, court proceedings, government announcements, deaths, births, marriages, sports, etc. In the coming years, the responsibility of preserving these comprehensive journalistic records primarily lies with news outlets and newspaper organizations, ensuring their availability for future generations. Online news publications are generated and updated instantly, following a non-linear format, which means that they can disappear and become inaccessible. Based on existing data, it has been observed that approximately 80% of web pages become inaccessible within a year, and around 13% of links, particularly web references in scholarly articles, cease to function after approximately 27 months [[
Even if a newspaper is backed up or archived by national archives and libraries, accessing specific information from multiple sources about a particular event may be challenging in the future. This challenge becomes even more complex when attempting to follow a story through an archive that comprises a vast collection generated from numerous news sources, each requiring different technologies to access the archived contents.
News archives are of two types, i.e., graphical formats and partially indexed archives, which makes it difficult to access particular news about an event because, many challenges encountered, such as:
- Vast archive collections: an archive created from many sources;
- Various sources on different platforms;
- Multilingual archive: an archive created from multiple languages, i.e., Arabic, Urdu, and English;
- Low-resource language: Access becomes more complicated when searching news article in low-resource languages, such as Urdu.
There are many difficulties in digital news preservation, such as;
- Extraction of news from diverse sources and different technological platforms;
- Extraction of explicit and implicit metadata;
- Computing similarity values between news articles;
- Conversion of news articles to a specific standard format for future integration and access, etc.
There are many challenges in accessing preserved digital news stories in archives, such as;
- Locating and discovering a digital resource among a huge collection, such as a catalog or archive [[
20 ]]; - The effectiveness of search mechanisms depends directly on how these objects are organized. Digital library management helps by providing support for identifying, describing, and locating resources;
- Interoperability is the ability of different systems to exchange and use information together without losing content and functionality, representing a huge challenge in archive management [[
21 ]]; - Providing mechanisms for digital objects to hold the data that prove their reliability, integrity, authenticity, and provenance [[
22 ]]; - Storing information about the physical characteristics and documenting behavior so that it can be emulated in future technologies [[
21 ]]. For example, "the original XML instance of imported data is maintained to preserve all mappings and to be able to roundtrip the original" [[23 ]]; - During the object development phase, multiple versions of the same object may be created for preservation and dissemination. Thus, the same object may be present in multiple versions; metadata tracks all the information regarding different versions and changes in the object over time;
- When seeking to utilize data collected for a different project in their own work, individuals aim to locate and utilize data while placing a greater emphasis on trust and comprehension. Reusing data typically necessitates meticulous preservation and documentation of both the data content and the accompanying metadata.
The primary purpose of the DNSP framework is to create a multilingual, multisource digital news stories archive to preserve digital news articles for the long term and future generations. The framework is enriched with two low-resource languages, i.e., Urdu and Arabic. The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply. The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework. The workflow and main components are presented in the enhanced version of the DNSP framework in Figure 1.
The following section provides a brief introduction to the Digital News Stories Archive (DNSA). The primary concept behind the digital news story preservation (DNSP) framework was introduced at the International Conference on Asian Digital Libraries 2015 (ICADL-2015) [[
- After analyzing 120 news archives worldwide, a comprehensive and generic systematic approach was proposed as a model for Web preservation. This approach entails a step-by-step procedure to be followed in web preservation projects [[
24 ]]; - A multisource web archive known as the Digital News Stories Archives (DNSA) was designed and developed to preserve online news articles originating from multiple sources [[
1 ]]; - A digital news story extractor (DNSE) tool was specifically designed to extract news articles from diverse sources and compile them to form the DNSA. Its primary function is to collect and gather news articles from various online sources, ensuring their preservation within the DNSA [[
26 ]]; - In the DNSA, we use content-based methods to link news articles during the preservation process. These methods rely on text features, such as the ratio of common terms based on their frequency [[
27 ]], named entities [[28 ]], the position of terms, terms in the headline, the credibility of information, the distance between similar terms, etc., [[1 ]]; - Similarity measures were studied in the most relevant field of news recommender systems. A comprehensive study was performed on recommendation systems that can enhance the DNSP framework in different dimensions and improve its utility (a few of them are discussed in future work) [[
29 ]]; - The CRMS technique was modified for news headings to reduce extra computation for the terms appearing in the news body for linking of English news articles during preservation [[
27 ]]; - The CRMS technique was updated for linking of Urdu-language news articles with English-language news articles in the DNSA [[
30 ]]; - A heading-based technique was introduced for linking of news articles in the DNSA during the preservation process in the DNSP framework [[
31 ]].
The Digital News Stories Archive (DNSA) is a news archive created locally from multiple online sources that provide news in three languages, i.e., English, Urdu, and Arabic. Currently, the DNSA is archiving news articles from seven online newspapers and three local news television networks [[
The high-level system architecture is illustrated in Figure 2. The figure depicts the process flow, starting from the ingest phase, where two mediators are employed to extract and incorporate metadata into the news story archive. Once the metadata are added, the news stories are archived and safeguarded for future use by generating an archival information package (AIP), as depicted in Figure 3. Subsequently, the preserved contents can be accessed using the information dissemination package.
To help readers in accessing relevant news articles and enhance their understanding of various topics, the DNSA (Digital News Story Archive) requires an effective mechanism for linking digital stories and recommending them to readers. This mechanism, as discussed in previous studies [[
Without an efficient search functionality, a news archive would essentially amount to a mere collection of news articles, lacking the ability to serve as a truly valuable information repository. To transform it into an effective repository, it is essential to implement a robust search functionality, which necessitates the use of indexing approaches and the establishment of a clearly defined set of metadata elements.
The digital news story extractor (DNSE) is a Java-based tool for extracting digital news stories from online news websites using JSOUP and POI libraries. Initially, the DNSE was developed for English news sources [[
- Non-uniform Web structure: There are many platforms and technologies for developing Web-based applications. Front-end technologies include HTML, CSS, JAVA, and JAVASCRIPT and its frameworks, and back-end logic creation technologies include PHP, ASP.net, and XML, among many others. Due to the use of different technologies, the Web structure varies; hence, extracting the desired information is challenging.
- Recency or maintenance of fresh content: The Web contents of dynamic Web applications, such as blogs and news websites, update instantly and frequently. The recency of news content is very important to maintain efficiently, considering access frequency and network traffic issues.
- Rise of anti-scraping tools: The biggest challenge in extracting news content is the rise of anti-scraping tools, e.g., Captcha, which differentiates between bots and humans. The extractor got stuck when anti-scraping tools were implemented.
- Unknown host issue: An unreliable Internet connection leads to an unknown host issue; the extraction of news can be restarted, but the interruption is time-consuming.
- Socket timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific period during preservation. The websites consider a bot unnecessary to send requests, overload the server, and start blocking access.
- Garbage collection: The inconsistency in development approaches leads to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code, during news extraction.
- Identifying and preprocessing of low-resource languages: The DNSE tool deploys different libraries for the identification and preprocessing of low-resource languages, and the preprocessing is computationally expensive.
- Firewall blocking: Few online news sources are protected from extraction using a firewall.
The developed extractor somewhat manages problems such as firewall blocking, the rise of anti-scraping applications, preprocessing, garbage collection, and dealing with non-uniform Web structure. The extractor handled the maintenance of fresh news articles, unknown host issues, and socket timeout issues using different techniques and Web APIs. Extraction is important for any digital archive and challenging when preserving low-resource contents. The enhanced DNSE is enabled to deal with the above challenges efficiently.
The "DNSA" is enriched with five sources that provide Urdu news articles and three online sources that publish news in the Arabic language. The details of the included news articles from all three languages are summarized in Table 1.
The progress of research is hindered by the slow development of the DNSP framework, along with limited resources and insufficient financial support. Initially, three local English newspapers, namely Dawn News, The Tribune, and The News were chosen as the test subjects for the DNSE tool. A total of 86,545 URLs (with an average of 2791 URLs) were extracted, including duplicate URLs for news stories. Among these extracted URLs, there were 23,843 unique URLs representing individual news stories (with an average of 769 unique URLs). The extraction results from the newspaper websites of Dawn News, The Tribune, and The News accounted for 6457 news stories (with an average of 208), 4914 news stories (with an average of 158), and 5713 news stories (with an average of 184), respectively [[
The new extraction/crawling results after the DNSE was enriched with two low-resource languages, i.e., Urdu and Arabic, are keenly analyzed for shortcomings of the DNSP framework and DNSE tool. Table 2 presents the detailed extraction of the DNSE for both the HRL and LRLs.
The extraction results are visualized in Figure 4 for all ten sources of the high-resource language, i.e., English. The results show that few news sources do not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.
Evaluating the frequency of new story extraction is crucial due to the continuous and non-periodic nature of the news stream, unlike printed media.
The extraction process was carried out daily or after waiting for a few days. Figure 5 presents the average count of extracted URLs and unique URLs. The figure illustrates that the number of newly extracted news URLs closely aligns with the count of new news stories obtained from online newspapers and various news channels.
The processing of low-resource languages is expensive in terms of time, complexity, and accuracy. The main problems with implementing the DNSE, including LRLs, are non-uniform web structure, unknown host issues, and garbage collection. Figure 6 and Figure 7 present the average extraction of new news articles and unique URLs, respectively.
Table 3 presents the error rate of URLs and story extraction during preservation for both high- and low-resource languages. The LRLs are associated with a large error rate because of non-uniform web structure, unknown host issues, maintenance of fresh content, anti-scraping tools, and garbage collection. It is observed that low-resource-language news sources are not very well maintained like high-resource-language news sources.
The number of online news articles necessitates the use of recommender system techniques to establish linkages between digital news stories during preservation. These techniques can be broadly categorized into collaborative filtering techniques and content-based techniques. Collaborative filtering techniques encounter several challenges, as they heavily rely on user opinions, demographics, and feedback to establish similarity [[
The common ratio measure for stories (CRMS) is a content-based similarity measure that has proven to be effective in linking digital news articles during preservation. The Digital News Story Archive (DNSA), ensures the future accessibility of related news articles by preserving and formatting linked news articles sourced from a vast corpus of news articles extracted from multiple sources. To determine the similarity among related news articles in the archive, a threshold value for CT/TT (a measure defined in the study) was established, representing the best indicator of similarity [[
Similarly, another content-based similarity measure is introduced based on the use of transliteration words known as the similarity measure based on transliteration words (SMTW). It is observed that the use of English transliteration words is common in Urdu manuscripts. This practice is anticipated to play a vital role in linking Urdu news articles with English news articles. These linking mechanisms link formatted news articles to ensure the future accessibility of related news articles from an enormous corpus of news articles extracted from multiple sources in the DNSA [[
"Transliteration is a process of using the text of one script in another script or the process of converting text from one language to another". In the field of linguistics, the act of incorporating a word or a group of words from one language into another language's writing scripts is known as borrowing. These borrowed words are commonly referred to as loan words [[
Complexity analysis of an algorithm is a technique used to analyze or predict resources that are required to solve a problem of a given size [[
- Time complexity;
- Space complexity;
- Accuracy.
Time complexity (or time efficiency) is the measure of the amount of time taken by an algorithm to execute or run as a function of the input size. It is a way to describe how efficient an algorithm is in terms of the amount of time it takes to solve a problem. There are two known approaches:
- The _B_empirical approach is used to measure time complexity experimentally, which has several limitations; for example, it depends on hardware resources, the software environment, and the implementation design;
- The _B_analytical approach, which encounters the limitations of the empirical approach and is independent of the computing hardware, programming languages, and complex detail of the algorithm. The execution time is estimated by counting the primitive operations of the statement for input values.
In this study, we analyze both linking algorithms, i.e., CRMDL (Algorithm 1) and SMTW (Algorithm 2), using an analytical approach. Table 4 presents the time complexity of CRMDL, and subsequently present the time complexity of SMTW.
The total time complexity of input size n for CRMDL is:
where k lies between 0 and n, uw = n. For average-case complexity, the value of k = n/2. Therefore,
The average time complexity is simplified to the order-of-growth function that interests the researchers by ignoring the low-order terms and multiplicative constants.
T(n) =
Space complexity refers to the amount of memory space required by an algorithm or program to solve a specific problem. It is a measure of how much memory is needed to execute an algorithm or program, and it is typically expressed in terms of the number of bytes or bits of memory required [[
- Determining the space as a function of array size, which depends on the nature of data and is normally represented by n;
- Space required by the instruction or statements of the algorithm, which is constant and represented by 1.
The total space complexity of input size n for CRMDL or SMTW is:
S(n) =
The accuracy of an algorithm refers to how well it solves a particular problem or produces the correct output for a given input. In other words, it measures how closely the algorithm's output matches the desired or correct output. The accuracy of an algorithm can be measured quantitatively, such as through the use of performance metrics like precision, recall, and F1 score. The proposed content-based algorithms, CRMDL and SMTW, are extensively analyzed in [[
The DNSA encounters rapid growth in both high- and low-resource languages due to its continuous extraction of news articles from multiple sources. As an example, the DNSE collects approximately 400 Urdu news articles from five different sources, 180 Arabic news articles from three sources, and 700 English news articles from ten online sources on a daily basis.
To evaluate the proposed similarity measures, the DNSA selects datasets based on the heading or title of news articles, focusing on currently hot topics from a general pool. Table 5 provides a summarized overview of the datasets used in the evaluation process and Table 6 is used to analyse the similarity based on human-based observation.
To evaluate the comprehensive overall impact of the proposed similarity measures, a dataset consisting of 282 news articles is utilized. These news articles are sourced from two online television broadcasters, namely Geo and Samaa News, and are available in both English and Urdu languages. The dataset includes 152 Urdu news articles and 130 English news articles that were selected from a general pool. For further details on the news articles used in the empirical evaluation, please refer to Table 7.
This section focuses on how SMTW "outperformed CRMDL" in linking Urdu, a low-resource language, with English, a high-resource language, using content-based techniques. The improvement achieved by SMTW is discussed in detail, and the comparison is based on three evaluation parameters, which are:
- Result improvement The outcomes of both CRMDL and SMTW techniques are compared, emphasizing the improved results achieved by SMTW and ranking them accordingly. The term "improvement" indicates that the result now includes all the relevant news articles within the top-five ranking or that the ranking of the relevant articles has been enhanced, bringing the most relevant news to the top. On the other hand, if a similar news article that was previously within the top five has been displaced, it is denoted as "dropped". The term "none" is used when both techniques yield the same results or when the new technique has no noticeable effect.
- Transliteration word impact Given that English transliterated words are commonly used in Urdu scripts, it is expected that they would influence the frequency of shared terms. Therefore, an analysis is conducted to examine the impact of transliteration words on the results. This analysis specifically focuses on showcasing the effects of linking Urdu and English news articles and how the presence of transliterated words affects the outcomes.
- Result accuracy (precision and recall) To evaluate the effectiveness of the proposed similarity measure, the accuracy of the results is measured by precision and recall and compared with CSM for dual-lingual news articles to assess the overall feasibility.
Table 8 highlights the superiority and improved performance of SMTW over CRMDL and CSM in linking Urdu news articles with relevant English news articles during the presentation and development of the DNSA. Transliterated words play a crucial role in calculating similarity values among relevant news in multilingual archived news articles. The SMTW demonstrates a significant improvement in similarity, with a 22% increase (5 out of 23) compared to CRMDL. Within this improvement, the ranking of relevant news articles improves by 13%, and the overall results show a 9% enhancement. In 74% of the cases, the results remain unchanged, indicating consistency between the techniques. However, for the Urdu news article "ur6", there is a drop of 4% in the computed similarity value.
Similarly, the impact of transliteration words on the count of common terms and subsequent computation is substantial. The number of common terms is directly influenced by the length of Urdu news articles, and it is observed that there are five (05) transliterated words present in these articles. As a result, the inclusion of these transliterated words leads to a significant improvement of 22% in the results. This improvement is attributed to a 75% increase in the count of common terms, as shown.
SMTW is more effective than CRMDL in linking news articles in two languages in the DNSA. Table 9 shows that SMTW works well on large datasets. The study also finds that sports news contains more English words in Urdu and provides better results, whereas Urdu news is not considerably influenced by English words. The results improved by 20% (6 out of 30), worsened by 04%, and stayed the same for 76% of stories. The percentage of English words in Urdu news articles varies from 20–30% depending on the type and length of the news articles.
Figure 8 and Figure 9 depict the precision and recall results for all datasets of news articles. These figures demonstrate that the proposed similarity measure, SMTW, achieves higher accuracy and comprehensiveness compared to CRMDL and CSM in linking of dual-language news articles within the DNSA. These results further emphasize the superiority of SMTW in effectively linking and aligning news articles across multiple languages.
In the preservation process, employing a "similarity measure based on transliteration words (SMTW)" appears to be a viable approach for calculating content-based similarity and linking Urdu-to-English news articles. The SMTW measure demonstrates effectiveness with lengthy news articles compared to shorter ones, and it is proven to be particularly suitable for sports news. By utilizing the SMTW measure, the Digital News Stories Archive ensures the preservation of linked and properly formatted news articles. This approach guarantees the future accessibility of related news articles from a vast corpus of news articles extracted from multiple sources. If transliteration words appear frequently in a given type of manuscript or in a language script, the SMTW perform betters.
The preservation of news stories is of great significance for multiple reasons. These stories offer in-depth information about events that encompass our culture and heritage, making them invaluable resources, and preserving news stories ensures their availability for long-term research purposes. However, the news stories published online are in danger of being lost because of constant changes in the technologies used by online publishing sources and the formats used by platforms. The preservation of news and the creation of news archives is challenging. It becomes even further complicated when an archive contains articles from a low-resourced and morphologically complex language like Urdu or Arabic. This study introduces a multilingual news archive for Urdu, Arabic, and English news article sources published online on eighteen news publishing platforms. The digital news stories extractor is enhanced to address major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for a high-resource language, i.e., English, and low-resource languages, i.e., Urdu and Arabic. The LRLs encountered a high error rate during preservation compared to the HRL: 10% and 03%, respectively. The extraction results show that two of the news sources are not regularly updated and release very few new news stories online. The Digital News Stories Archive framework successfully preserved an average of 879 news articles from ten high-resource-language (HRL) sources and 553 news articles from eight low-resource-language (LRL) news sources. In the context of the DNSA, we compared two bilingual linking mechanisms, namely the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) for linking of Urdu-to-English language news articles. The SMTW demonstrated superior results compared to the CRMDL technique and CSM. It was observed that approximately 78% of Urdu news articles contained transliterated words. The precision improved from 46% and 55% to 60%, while the recall improved from 64% and 67% to 82%. The impact of common terms also exhibited improvement. Notably, the SMTW was proven effective and feasible for sports news.
This study highlights the challenges faced in dealing with low-resource languages (LRLs) and outlines research challenges. It also provides insights into how the framework can be enhanced and emphasizes the need for a more detailed investigation to ensure accurate extraction and archiving of news content for future access. The framework holds potential for further expansion and exploration in various dimensions.
This research highlights challenges encountered in low-resource languages (LRLs) and explores the associated research challenges. It also presents the improvements made to the framework and emphasizes the necessity of a comprehensive investigation to ensure precise extraction and archiving of news content for future retrieval. Furthermore, the framework holds potential for extension across various aspects, such as:
- Thorough analysis of Arabic script, which is necessary to facilitate multilingual linking;
- To provide access to the archived contents of the DNSA, the implementation of a standardized user interface is essential;
- The DNSE tool should be developed to meet professional standards;
- Meta attributes can be expanded to accommodate multilingual archives and include languages like Urdu, Arabic, Pashto, and other languages;
- Implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources;
- We are working on enhancing the structure of the Urdu-to-English lexicon and optimizing the bag of Urdu words to improve processing efficiency;
- Advanced content-based similarity measures should be developed, utilizing various features, such as weighted terms, named entities, term position, and contextual information from news articles;
- The DNSA needs crosslingual techniques for linking of multilingual archived news;
- Metadata elements need to be proposed for the digital news story preservation framework for efficient archive management and information dissemination;
- A more comprehensive set of generic elements for well-structured and well-populated online sources is required.
This study presents details of the framework's enhancements and emphasizes the need for a more comprehensive investigation into the accurate extraction and archiving of news content for future retrieval. The framework holds potential for future expansion in various aspects, such as:
- Thorough analysis of the Arabic script to facilitate multilingual linking;
- Development of a standardized user interface to facilitate access to archived content in the DNSA;
- Professional-level development of the DNSE tool;
- Creation of meta attributes for multilingual archives, encompassing languages such as Urdu, Arabic, Pashto, etc.
- Addition of implicit meta elements to the proposed set after comprehensive evaluation of individual sources;
- Ongoing efforts to enhance the structure of the Urdu-to-English lexicon and the bag of Urdu words to improve processing efficiency;
- The design of more advanced content-based similarity measures incorporating diverse features such as weighted terms, named entities, term position, and contextual information within news articles.
- Incorporation of crosslingual techniques in the DNSA for linking of multilingual archived news.
Graph: Figure 1 Enhanced digital news story preservation framework.
Graph: Figure 2 High-Level System Architecture.
Graph: Figure 3 Archival information package (AIP) of DNSA [[
Graph: Figure 4 Average new news story extraction for high-resource language "English" from different sources.
Graph: Figure 5 Average URL extraction and unique URL extraction for HRLs.
Graph: Figure 6 Unique URL extraction and new URL extraction comparison.
Graph: Figure 7 Comparison of news URL extraction from online newspapers.
Graph: Figure 8 Comparison of the precision of SMTW, CRDML, and CSM.
Graph: Figure 9 Comparison of the recall of SMTW, CRDML, and CSM.
Table 1 News sources in the DNSA.
No. News Source Abbreviation Language 01 DAWN News DN English 02 The Tribune TT English 03 The News TN English 04 Geo News GN English 05 Pakistan Observer PO English 06 Pakistan Today PT English 07 ARY News AN English 08 Samaa News SN English 09 Voice of Journalist VJ English 10 Time of Pakistan TP English 11 Express Ex Urdu 12 Daily Pakistan DP Urdu 13 Samaa Urdu SU Urdu 14 Geo Urdu GU Urdu 15 Dawn News DU Urdu 16 Al-Jazirah Online AO Arabic 17 Al-Riaz AR Arabic 18 Okaz OK Arabic
Table 2 Average of Six Days Extraction Results of DNSE for both the HRL and LRLs for All Sources.
No. News Source Extracted URls Unique URLs New URLs 01 DN 1277 304 166 02 TT 711 243 111 03 TN 816 230 131 04 GN 306 95 66 05 PO 405 136 108 06 PT 490 170 151 07 AN 223 86 62 08 SN 178 65 47 09 VJ 159 39 19 10 TP 127 49 18 11 Ex 295 173 99 12 DP 202 123 75 13 SU 270 144 91 14 GU 197 99 55 15 DU 175 93 45 16 AO 211 101 63 17 AR 154 83 50 18 OK 192 110 75
Table 3 Error rate in both HRL (English) and LRLs (Urdu and Arabic) during extraction.
Day HRL Sources Error Rate Percentage LRLs Sources Error Rate Percentage 01 1572 81 5% 948 122 13% 02 712 26 4% 469 55 12% 03 781 31 4% 472 49 10% 04 746 25 3% 480 64 13% 05 716 19 3% 457 46 10% 06 745 21 3% 493 42 9%
Table 4 Time complexity of CRMDL.
Statement Unit Cost Total Cost UNA Pre-processing n n T = {t, t, t, ..., t} n n Remove stopwords (if any) k k n + 1 n + 1 Find tf for each term t from T n n Update Map(UNA) 1 1 Map(UNA) = {(tf, w), (tf, w), (tf, w), ..., (tf, w)} n n Identify the English meaning of each Urdu term in the dictionary 1 1 Identify multiple meanings for each Urdu word (if any) k k n + 1 n + 1 T = {t, t, t, ..., t} n n Remove stopwords (if any) k k (m) n ∗ (m) Find tf for each term t from T n n Update Map(ENA) 1 1 Map(ENA) = {(tf, w), (tf, w), (tf, w), ..., (tf, w)} n n Map(ENA) = {(tf, w), (tf, w), (tf, w), ..., (tf, w)} n n Map(UNA) = {(tf, w), (tf, w), (tf, w), ..., (tf, w)} n n CT = (tf + tf)W + (tf + tf)W + ... + (tf + tf)W (tf + tf)W (tf + tf)W UT = (tf ∨ tf)W + (tf ∨ tf)W + ... + (tf ∨ tf)W (tf∨ tf)W (tf∨ tf)W TT = UT + CT 1 1 CRMDL = CT/TT 1 1
Table 5 Dataset overview for linking of bilingual news articles.
News Articles Observed Similarity 1 4 3 1 3 3 Yes Yes Yes 2 10 2 5 5 5 Yes Yes Yes 3 20 1 10 10 5 Yes Yes Yes 4 282 2 152 130 4 No Yes Yes
Table 6 Overview: dataset of 20 news articles.
Type of News News Articles News Articles About 3 1, 6, 10 PSL, Cricket Sports News 2 7, 9 WI tour, teams announcement 1 5 ICC president resigns 3 2, 6, 8 COAS, army General News 1 3 Trump travel ban 1 4 MQM leader
Table 7 News articles analyzed for similarity.
Urdu Article English Translation Budget 2017–2018: Government employees were made happy Description Having no exact match, much similar news, general news, and of average length Stats Six relevant news of 55 and No exact match Urdu Article English Translation The Ramadan moon sighted, the first fast will be tomorrow Description Having no exact match, much similar news, general news, and of short length Stats Nine relevant news of 55 and No exact match Urdu Article English Translation Budget, 10% raise in salaries and pension Description Having one exact match, much similar news, general news, and of average length Stats Eight relevant news of 74 and One exact match Urdu Article English Translation Yonus Khan's all-time test captain is Imran Khan Description Having one exact match, much similar news, sports news, and of average length Stats Seven relevant news of 74 and One exact match
Table 8 Improved results of SMTW approach vs. CRMDL approach for 20 news article sets.
Urdu News CRMDL Rank SMTW Rank SMTW CT Impact eng1 eng1 ⯅ eng10 eng6 - urNews1 eng6 eng10 ⯅ eng7 eng7 ⯅ eng4 eng4 ⯅ eng2 eng2 ⯅ eng6 eng6 - urNews2 eng8 eng8 ⯅ eng5 eng5 - eng3 eng3 - urNews3 eng3 eng3 - eng4 eng4 - urNews4 eng4 eng4 ⯅ eng7 eng7 - urNews5 eng5 eng5 ⯅ eng1 eng1 ⯅ urNews6 eng6 eng6 ⯅ eng2 eng2 ⯅ eng1 eng8 ⯅ eng7 eng1 ⯅ eng10 eng7 ⯅ eng8 eng10 ⯅ urNews7 eng7 eng7 ⯅ eng1 eng10 ⯅ eng10 eng6 ⯅ eng3 eng1 ⯅ eng6 eng9 ⯅ urNews8 eng8 eng8 ⯅ eng3 eng6 ⯅ eng4 eng2 ⯅ eng1 eng4 ⯅ eng2 eng3 ⯅ urNews9 eng7 eng9 ⯅ eng9 eng7 ⯅ eng5 eng6 ⯅ urNews10 eng1 eng10 ⯅ eng10 eng6 ⯅ eng6 eng1 ⯅ eng7 eng7 ⯅
Table 9 Improved results of SMTW Approach vs. CRMDL Approach for one-day news article sets. ⯅ indicates improved results, and "-" represent "no Change or no impact".
Eng1 49 75 ⯅ Eng2 22 31 ⯅ Eng3 22 31 ⯅ Eng4 25 36 ⯅ Eng5 26 34 ⯅ Eng6 12 20 ⯅ Eng1 14 18 ⯅ Eng2 12 14 ⯅ Eng3 06 06 - Eng4 04 04 - Eng5 07 07 - Eng6 06 06 - Eng7 06 06 - Eng8 11 11 - Eng9 02 02 - Eng1 82 121 ⯅ Eng2 83 115 ⯅ Eng3 162 219 ⯅ Eng4 87 106 ⯅ Eng5 66 97 ⯅ Eng6 55 83 ⯅ Eng7 56 86 ⯅ Eng8 42 71 ⯅ Eng1 53 122 ⯅ Eng2 65 176 ⯅ Eng3 37 81 ⯅ Eng4 27 69 ⯅ Eng5 24 103 ⯅ Eng6 19 50 ⯅ Eng7 13 71 ⯅
M.K.: conceptualization, methodology, experimentation, development, data collection, and manuscript writing; K.U.: conceptualization, methodology, experimentation, manuscript writing, and proofreading; Y.A.: conceptualization, methodology, proofreading, and supervision; A.A. (Ali Alferaidi): conceptualization, proofreading, and supervision; T.S.A.: conceptualization, methodology, proofreading, and supervision; K.Y.: conceptualization, methodology, and proofreading; N.A.: conceptualization, methodology, and proofreading; A.A. (Akash Ahmad): conceptualization, methodology, and proofreading. All authors have read and agreed to the published version of the manuscript.
Not applicable.
This article does not involve humans or animals.
Not applicable.
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
SMTW Similarity measure based on transliteration words CRMDL common ratio measure for dual languages CSM Cosine similarity measure WWW World Wide Web DNSA Digital News Stories Archive DNSP digital news story preservation DNSE Digital news story extractor CT Common terms TT Total terms UT Uncommon terms UrNews Urdu news EngN English news Ur Urdu Eng English ICADL International Conference on Asian Digital Libraries AI Artificial intelligence API Application programming interface
By Muzammil Khan; Kifayat Ullah; Yasser Alharbi; Ali Alferaidi; Talal Saad Alharbi; Kusum Yadav; Naif Alsharabi and Aakash Ahmad
Reported by Author; Author; Author; Author; Author; Author; Author; Author