A Unified Stemming Framework for Arabic-Script South Asian Languages: Sindhi, Urdu, and Persian

Mohsin Raza Shah; Amjad Ali Mahesar

Authors

Mohsin Raza Shah Assistant Professor, Computer Science, College Education Department, Government of Sindh
Amjad Ali Mahesar Lecturer, College Education Department, Government of Sindh

Abstract

Stemming is a fundamental preprocessing operation in Natural Language Processing (NLP) and Information Retrieval (IR) systems, enabling the reduction of morphologically inflected or derived words to their root or stem form. Sindhi, Urdu, and Persian are three widely spoken languages that share the Perso-Arabic script, exhibit overlapping morphological structures, and carry significant lexical borrowing from one another (Ali, Khalid & Saleemi, 2019; Shah, 2016). Despite this shared linguistic substrate, all existing stemmers for these languages have been developed independently and, in a language, -specific manner, leading to redundant eƯort, incompatible resources, and limited cross-lingual applicability. This paper proposes UPASS the Unified Perso-Arabic Script Stemmer a modular, language independent framework for stemming Sindhi, Urdu, and Persian using a shared rule architecture, a cross-lingual morpheme repository, and a unified algorithmic pipeline. The framework is built upon a detailed comparative morphological analysis of the three languages, the rule-based stripping approach validated for Sindhi secondary words (Shah, 2016), infix-stemming advances for Urdu (Ali et al., 2019), and multi-phase suƯix-prefix removal techniques for Persian (Estahbanati & Javidan, 2009). Experimental evaluation on standard corpora demonstrates that UPASS achieves a cumulative average Stemmed Error Rate (SER) of 11.28% and an average accuracy of 88.72% across the three languages, consistently outperforming language-specific baselines.

Keywords: Stemming, Information Retrieval, Natural Language Processing, Sindhi, Urdu, Persian, Perso-Arabic Script, Morphological Analysis, Cross-lingual NLP, Rule-Based Approach

A Unified Stemming Framework for Arabic-Script South Asian Languages: Sindhi, Urdu, and Persian

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Current Issue

Browse

Information

Make a Submission

Developed By