A Unified Stemming Framework for Arabic-Script South Asian Languages: Sindhi, Urdu, and Persian
Abstract
Stemming is a fundamental preprocessing operation in Natural Language Processing (NLP) and Information Retrieval (IR) systems, enabling the reduction of morphologically inflected or derived words to their root or stem form. Sindhi, Urdu, and Persian are three widely spoken languages that share the Perso-Arabic script, exhibit overlapping morphological structures, and carry significant lexical borrowing from one another (Ali, Khalid & Saleemi, 2019; Shah, 2016). Despite this shared linguistic substrate, all existing stemmers for these languages have been developed independently and, in a language, -specific manner, leading to redundant eƯort, incompatible resources, and limited cross-lingual applicability. This paper proposes UPASS the Unified Perso-Arabic Script Stemmer a modular, language independent framework for stemming Sindhi, Urdu, and Persian using a shared rule architecture, a cross-lingual morpheme repository, and a unified algorithmic pipeline. The framework is built upon a detailed comparative morphological analysis of the three languages, the rule-based stripping approach validated for Sindhi secondary words (Shah, 2016), infix-stemming advances for Urdu (Ali et al., 2019), and multi-phase suƯix-prefix removal techniques for Persian (Estahbanati & Javidan, 2009). Experimental evaluation on standard corpora demonstrates that UPASS achieves a cumulative average Stemmed Error Rate (SER) of 11.28% and an average accuracy of 88.72% across the three languages, consistently outperforming language-specific baselines.
Keywords: Stemming, Information Retrieval, Natural Language Processing, Sindhi, Urdu, Persian, Perso-Arabic Script, Morphological Analysis, Cross-lingual NLP, Rule-Based Approach