TPTS:  Text Pre-processing Techniques for Sindhi Language:  Text Pre-processing Techniques

Ali Nawaz Ali Nawaz; Muhammad Nawaz; Noor Ahmed    Shaikh; Samina    Rajper; Junaid  Baber; Muhammad Khalid

doi:10.58619/pjest.v4i3.89

TPTS: Text Pre-processing Techniques for Sindhi Language

Published: Jun 28, 2023

DOI: https://doi.org/10.58619/pjest.v4i3.89

Keywords:

Text pre-processing, natural language processing (NLP), ROUGE evaluation, TF-IDF weighting technique, rule-based model, Sindhi language.

Ali Nawaz Ali Nawaz

03337891591

Muhammad Nawaz

Shah Abdul Latif University (SALU) Khairpur, Pakistan

Noor Ahmed Shaikh

aliuob15@gmail.com

Samina Rajper

Shah Abdul Latif University (SALU) Khairpur, Pakistan

Junaid Baber

University of Balochistan, Quetta, Pakistan

Muhammad Khalid

HITEC University, Taxila, Pakistan

Abstract

The Internet is a significant source of textual data, with users generating vast amounts of information through social media and news agencies daily. The extraction of meaningful information from large datasets is a challenging and costly process. Text pre-processing is a crucial initial step in any Natural Language Processing (NLP) task, as it can impact the overall performance of the study. The main objective of text pre-processing is to transform unstructured text into a linguistically meaningful (standard form) format, making extracting information for any text-processing task easier. This paper introduces TPTS, a model for text pre-processing in the Sindhi language. TPTS performs essential NLP tasks such as text tokenization, normalization, stop-word removal, stemming, and POS tagging for the Sindhi language. The Sindhi Text Corpus (STC), consisting of 1.5k Sindhi text documents collected from various online news websites, is used for experimentation. The TF-IDF approach is employed to identify high-frequency stop-words in the Sindhi language.

How to Cite

Ali Nawaz, A. N., Muhammad Nawaz, Shaikh, N. A. . ., Rajper, S. . ., Baber, J. ., & Muhammad Khalid. (2023). TPTS: Text Pre-processing Techniques for Sindhi Language: Text Pre-processing Techniques . Pakistan Journal of Emerging Science and Technologies (PJEST), 4(3). https://doi.org/10.58619/pjest.v4i3.89

Issue

Vol. 4 No. 3 (2023)

Section

Articles

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Pakistan Journal Emerging Science and Technologies (PJEST) in collaboration with Govt. Islamia Graduate College Civil Lines Lahore, Pakistan is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Author Biographies

Muhammad Nawaz, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Noor Ahmed Shaikh, aliuob15@gmail.com

Samina Rajper, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Junaid Baber, University of Balochistan, Quetta, Pakistan

Muhammad Khalid, HITEC University, Taxila, Pakistan

References

A. Reshamwala, D. Mishra, and P. Pawar, "Review on natural language processing," IRACST Engineering Science and Technology: An International Journal (ESTIJ), vol. 3, pp. 113-116, 2013.

T. P. Nagarhalli, V. Vaze, and N. Rana, "Impact of machine learning in natural language processing: A review," in 2021 third international conference on intelligent communication technologies and virtual mobile networks (ICICV), 2021, pp. 1529-1534.

M. Shamsfard, S. Kiani, and Y. Shahedi, "STeP-1: standard text preparation for Persian language," in Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3), 2009.

R. Motlani, "Developing language technology tools and resources for a resource-poor language: Sindhi," in Proceedings of the NAACL Student Research Workshop, 2016, pp. 51-58.

M. A. Dootio and A. I. Wagan, "Syntactic parsing and supervised analysis of Sindhi text," Journal of King Saud University-Computer and Information Sciences, vol. 31, pp. 105-112, 2019.

N. A. Shaikh, G. A. Mallah, and Z. A. Shaikh, "Character segmentation of Sindhi, an Arabic style scripting language, using height profile vector," Australian Journal of Basic and Applied Sciences, vol. 3, pp. 4160-4169, 2009.

Y. A. Solangi, Z. A. Solangi, A. Raza, N. A. Shaikh, G. A. Mallah, and A. Shah, "Offline-printed sindhi optical text recognition: Survey," in 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS), 2018, pp. 1-5.

I. N. Sodhar, J. Hussain, A. Buller, and A. Sodhar, "TOKENIZATION OF SINDHI TEXT ON INFORMATION RETRIEVAL TOOL," PAKISTAN J. Emerg. Sci. Technol, vol. 1, pp. 10-16, 2020.

J. A. Mahar and G. Q. Memon, "Rule based part of speech tagging of sindhi language," in 2010 International Conference on Signal Acquisition and Processing, 2010, pp. 101-106.

W. A. Narejo, J. A. Mahar, S. A. Mahar, F. A. Surahio, and A. K. Jumani, "Sindhi morphological analysis: an algorithm for sindhi word segmentation into morphemes," Int. J. Comput. Sci. Inf. Secur, vol. 293, 2016.

S. Mahar, "Comparative Analysis of Vowel Restoration for Arabic Script Based Languages Using N-Gram Models," MS Thesis, Department of Computer Science, Shah Abdul Latif University …, 2014.

M. SHAH, H. Shaikh, J. MAHAR, and S. MAHAR, "Sindhi stemmer for information retrieval system using rule-based stripping approach," Sindh University Research Journal-SURJ (Science Series), vol. 48, 2016.

M. O. Hegazi, Y. Al-Dossari, A. Al-Yahy, A. Al-Sumari, and A. Hilal, "Preprocessing Arabic text on social media," Heliyon, vol. 7, p. e06191, 2021.

M. Anandarajan, C. Hill, T. Nolan, M. Anandarajan, C. Hill, and T. Nolan, "Text preprocessing," Practical text analytics: Maximizing the value of text data, pp. 45-59, 2019.

A. El Kah and I. Zeroual, "The effects of pre-processing techniques on Arabic text classification," Int. J, vol. 10, pp. 1-12, 2021.

A. Nawaz, R. A. Shaikh, R. H. Arain, S. Rajper, J. Baber, and M. M. Baidani, "Text Summarizer for Sindhi Language," Available at SSRN 4288269.

S. Mohtaj, B. Roshanfekr, A. Zafarian, and H. Asghari, "Parsivar: A language processing toolkit for Persian," in Proceedings of the eleventh international conference on language resources and evaluation (lrec 2018), 2018.

A. Nawaz, M. Bakhtyar, J. Baber, I. Ullah, W. Noor, and A. Basit, "Extractive text summarization models for Urdu language," Information Processing & Management, vol. 57, p. 102383, 2020.

C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li, "Adaptive parser-centric text normalization," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1159-1168.

A. Qaroush, I. A. Farha, W. Ghanem, M. Washaha, and E. Maali, "An efficient single document Arabic text summarization using a combination of statistical and semantic features," Journal of King Saud University-Computer and Information Sciences, vol. 33, pp. 677-692, 2021.

M. Sadeghi and J. Vegas, "Automatic identification of light stop words for Persian information retrieval systems," Journal of Information Science, vol. 40, pp. 476-487, 2014.

A. Daud, W. Khan, and D. Che, "Urdu language processing: a survey," Artificial Intelligence Review, vol. 47, pp. 279-311, 2017.

M. A. Dootio and A. I. Wagan, "Development of Sindhi text corpus," Journal of King Saud University-Computer and Information Sciences, vol. 33, pp. 468-475, 2021.

R. Al-Shalabi, G. Kanaan, J. M. Jaam, A. Hasnah, and E. Hilat, "Stop-word removal algorithm for Arabic language," in Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004., 2004, p. 545.

A. A. Sattar, S. Abbasi, M. U. Rahman, A. Baig, and M. Nizamani, "Sindhi stemmer using affix removal method," International Journal, vol. 10, 2021.

P. Willett, "The Porter stemming algorithm: then and now," Program, vol. 40, pp. 219-223, 2006.

A. Al-Omari, B. Abuata, and M. Al-Kabi, "Building and benchmarking new heavy/light Arabic stemmer," in The 4th International conference on Information and Communication systems (ICICS’13), 2013, pp. 17-22.

S. Khan, W. Anwar, U. Bajwa, and X. Wang, "Template based affix stemmer for a morphologically rich language," International Arab Journal of Information Technology (IAJIT), vol. 12, 2015.

J. Mehrad and S. Berenjian, "Providing a Persian language singular-stemmer system (RICeST Stemmer)," 2011.

R. Kansal, V. Goyal, and G. S. Lehal, "Rule based urdu stemmer," in Proceedings of COLING 2012: Demonstration Papers, 2012, pp. 267-276.

S. D. Makhija, "A study of different stemmer for sindhi language based on devanagari script," in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 2326-2329.

B. Nathani, N. Joshi, and G. Purohit, "Design and development of unsupervised Stemmer for Sindhi language," Procedia Computer Science, vol. 167, pp. 1920-1927, 2020.

W. Ali, R. Kumar, Y. Dai, J. Kumar, and S. Tumrani, "Neural Joint Model for Part-of-Speech Tagging and Entity Extraction," in 2021 13th International Conference on Machine Learning and Computing, 2021, pp. 239-245.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, "Parts of speech tagging of Romanized Sindhi text by applying rule based model," IJCSNS, vol. 19, p. 91, 2019.

S. Vijayarani, M. J. Ilamathi, and M. Nithya, "Preprocessing techniques for text mining-an overview," International Journal of Computer Science & Communication Networks, vol. 5, pp. 7-16, 2015.

TPTS: Text Pre-processing Techniques for Sindhi Language Text Pre-processing Techniques

Abstract

Muhammad Nawaz, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Noor Ahmed Shaikh, aliuob15@gmail.com

Samina Rajper, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Junaid Baber, University of Balochistan, Quetta, Pakistan

Muhammad Khalid, HITEC University, Taxila, Pakistan

References

Most read articles by the same author(s)

Article Sidebar

Main Article Content

Abstract

Article Details

Muhammad Nawaz, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Noor Ahmed Shaikh, aliuob15@gmail.com

Samina Rajper, Shah Abdul Latif University (SALU) Khairpur, Pakistan

Junaid Baber, University of Balochistan, Quetta, Pakistan

Muhammad Khalid, HITEC University, Taxila, Pakistan

References

Most read articles by the same author(s)