TPTS: Text Pre-processing Techniques for Sindhi Language Text Pre-processing Techniques

Main Article Content

Ali Nawaz Ali Nawaz
Muhammad Nawaz
Noor Ahmed Shaikh
Samina Rajper
Junaid Baber
Muhammad Khalid

Abstract

The Internet is a significant source of textual data, with users generating vast amounts of information through social media and news agencies daily. The extraction of meaningful information from large datasets is a challenging and costly process. Text pre-processing is a crucial initial step in any Natural Language Processing (NLP) task, as it can impact the overall performance of the study. The main objective of text pre-processing is to transform unstructured text into a linguistically meaningful (standard form) format, making extracting information for any text-processing task easier. This paper introduces TPTS, a model for text pre-processing in the Sindhi language. TPTS performs essential NLP tasks such as text tokenization, normalization, stop-word removal, stemming, and POS tagging for the Sindhi language. The Sindhi Text Corpus (STC), consisting of 1.5k Sindhi text documents collected from various online news websites, is used for experimentation. The TF-IDF approach is employed to identify high-frequency stop-words in the Sindhi language.


 

Article Details

How to Cite
Ali Nawaz, A. N., Muhammad Nawaz, Shaikh, N. A. . ., Rajper, S. . ., Baber, J. ., & Muhammad Khalid. (2023). TPTS: Text Pre-processing Techniques for Sindhi Language: Text Pre-processing Techniques . Pakistan Journal of Emerging Science and Technologies (PJEST), 4(3). https://doi.org/10.58619/pjest.v4i3.89
Section
Articles
Author Biographies

Muhammad Nawaz, Shah Abdul Latif University (SALU) Khairpur, Pakistan

 

 

Noor Ahmed Shaikh, aliuob15@gmail.com

 

 

Samina Rajper, Shah Abdul Latif University (SALU) Khairpur, Pakistan

 

 

Junaid Baber, University of Balochistan, Quetta, Pakistan

 

 

Muhammad Khalid, HITEC University, Taxila, Pakistan

 

 

References

A. Reshamwala, D. Mishra, and P. Pawar, "Review on natural language processing," IRACST Engineering Science and Technology: An International Journal (ESTIJ), vol. 3, pp. 113-116, 2013.

T. P. Nagarhalli, V. Vaze, and N. Rana, "Impact of machine learning in natural language processing: A review," in 2021 third international conference on intelligent communication technologies and virtual mobile networks (ICICV), 2021, pp. 1529-1534.

M. Shamsfard, S. Kiani, and Y. Shahedi, "STeP-1: standard text preparation for Persian language," in Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3), 2009.

R. Motlani, "Developing language technology tools and resources for a resource-poor language: Sindhi," in Proceedings of the NAACL Student Research Workshop, 2016, pp. 51-58.

M. A. Dootio and A. I. Wagan, "Syntactic parsing and supervised analysis of Sindhi text," Journal of King Saud University-Computer and Information Sciences, vol. 31, pp. 105-112, 2019.

N. A. Shaikh, G. A. Mallah, and Z. A. Shaikh, "Character segmentation of Sindhi, an Arabic style scripting language, using height profile vector," Australian Journal of Basic and Applied Sciences, vol. 3, pp. 4160-4169, 2009.

Y. A. Solangi, Z. A. Solangi, A. Raza, N. A. Shaikh, G. A. Mallah, and A. Shah, "Offline-printed sindhi optical text recognition: Survey," in 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS), 2018, pp. 1-5.

I. N. Sodhar, J. Hussain, A. Buller, and A. Sodhar, "TOKENIZATION OF SINDHI TEXT ON INFORMATION RETRIEVAL TOOL," PAKISTAN J. Emerg. Sci. Technol, vol. 1, pp. 10-16, 2020.

J. A. Mahar and G. Q. Memon, "Rule based part of speech tagging of sindhi language," in 2010 International Conference on Signal Acquisition and Processing, 2010, pp. 101-106.

W. A. Narejo, J. A. Mahar, S. A. Mahar, F. A. Surahio, and A. K. Jumani, "Sindhi morphological analysis: an algorithm for sindhi word segmentation into morphemes," Int. J. Comput. Sci. Inf. Secur, vol. 293, 2016.

S. Mahar, "Comparative Analysis of Vowel Restoration for Arabic Script Based Languages Using N-Gram Models," MS Thesis, Department of Computer Science, Shah Abdul Latif University …, 2014.

M. SHAH, H. Shaikh, J. MAHAR, and S. MAHAR, "Sindhi stemmer for information retrieval system using rule-based stripping approach," Sindh University Research Journal-SURJ (Science Series), vol. 48, 2016.

M. O. Hegazi, Y. Al-Dossari, A. Al-Yahy, A. Al-Sumari, and A. Hilal, "Preprocessing Arabic text on social media," Heliyon, vol. 7, p. e06191, 2021.

M. Anandarajan, C. Hill, T. Nolan, M. Anandarajan, C. Hill, and T. Nolan, "Text preprocessing," Practical text analytics: Maximizing the value of text data, pp. 45-59, 2019.

A. El Kah and I. Zeroual, "The effects of pre-processing techniques on Arabic text classification," Int. J, vol. 10, pp. 1-12, 2021.

A. Nawaz, R. A. Shaikh, R. H. Arain, S. Rajper, J. Baber, and M. M. Baidani, "Text Summarizer for Sindhi Language," Available at SSRN 4288269.

S. Mohtaj, B. Roshanfekr, A. Zafarian, and H. Asghari, "Parsivar: A language processing toolkit for Persian," in Proceedings of the eleventh international conference on language resources and evaluation (lrec 2018), 2018.

A. Nawaz, M. Bakhtyar, J. Baber, I. Ullah, W. Noor, and A. Basit, "Extractive text summarization models for Urdu language," Information Processing & Management, vol. 57, p. 102383, 2020.

C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li, "Adaptive parser-centric text normalization," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1159-1168.

A. Qaroush, I. A. Farha, W. Ghanem, M. Washaha, and E. Maali, "An efficient single document Arabic text summarization using a combination of statistical and semantic features," Journal of King Saud University-Computer and Information Sciences, vol. 33, pp. 677-692, 2021.

M. Sadeghi and J. Vegas, "Automatic identification of light stop words for Persian information retrieval systems," Journal of Information Science, vol. 40, pp. 476-487, 2014.

A. Daud, W. Khan, and D. Che, "Urdu language processing: a survey," Artificial Intelligence Review, vol. 47, pp. 279-311, 2017.

M. A. Dootio and A. I. Wagan, "Development of Sindhi text corpus," Journal of King Saud University-Computer and Information Sciences, vol. 33, pp. 468-475, 2021.

R. Al-Shalabi, G. Kanaan, J. M. Jaam, A. Hasnah, and E. Hilat, "Stop-word removal algorithm for Arabic language," in Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004., 2004, p. 545.

A. A. Sattar, S. Abbasi, M. U. Rahman, A. Baig, and M. Nizamani, "Sindhi stemmer using affix removal method," International Journal, vol. 10, 2021.

P. Willett, "The Porter stemming algorithm: then and now," Program, vol. 40, pp. 219-223, 2006.

A. Al-Omari, B. Abuata, and M. Al-Kabi, "Building and benchmarking new heavy/light Arabic stemmer," in The 4th International conference on Information and Communication systems (ICICS’13), 2013, pp. 17-22.

S. Khan, W. Anwar, U. Bajwa, and X. Wang, "Template based affix stemmer for a morphologically rich language," International Arab Journal of Information Technology (IAJIT), vol. 12, 2015.

J. Mehrad and S. Berenjian, "Providing a Persian language singular-stemmer system (RICeST Stemmer)," 2011.

R. Kansal, V. Goyal, and G. S. Lehal, "Rule based urdu stemmer," in Proceedings of COLING 2012: Demonstration Papers, 2012, pp. 267-276.

S. D. Makhija, "A study of different stemmer for sindhi language based on devanagari script," in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 2326-2329.

B. Nathani, N. Joshi, and G. Purohit, "Design and development of unsupervised Stemmer for Sindhi language," Procedia Computer Science, vol. 167, pp. 1920-1927, 2020.

W. Ali, R. Kumar, Y. Dai, J. Kumar, and S. Tumrani, "Neural Joint Model for Part-of-Speech Tagging and Entity Extraction," in 2021 13th International Conference on Machine Learning and Computing, 2021, pp. 239-245.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, "Parts of speech tagging of Romanized Sindhi text by applying rule based model," IJCSNS, vol. 19, p. 91, 2019.

S. Vijayarani, M. J. Ilamathi, and M. Nithya, "Preprocessing techniques for text mining-an overview," International Journal of Computer Science & Communication Networks, vol. 5, pp. 7-16, 2015.