Tokenization of Sindhi Text on Information Retrieval Tool

Irum Naz  Sodhar; Akhtar Hussain  Jalbani; Abdul Hafeez  Buller; Anam Naz  Sodhar

doi:10.58619/pjest.v1i1.130

PDF

Published: Apr 15, 2021

DOI: https://doi.org/10.58619/pjest.v1i1.130

Keywords:

Artificial Intelligence Natural Language Processing Sindhi Tokenization Information Retrieval tool

Irum Naz Sodhar

Department of Information Technology, Shaheed Benazir Bhutto University, Sindh-Pakistan

Akhtar Hussain Jalbani

Department of Information Technology, Quaid-e-Awam University of Engineering Science & Technology, Nawabshah, Sindh-Pakistan

Abdul Hafeez Buller

Engineering Section, Quaid-e-Awam University of Engineering Science & Technology, Nawabshah, Sindh-Pakistan.

Anam Naz Sodhar

Quaid-e-Awam University of Engineering Science & Technology, Nawabshah Sindh-Pakistan

Abstract

In the artificial intelligence divided into sub fields, Natural language processing (NLP) is also field of AI and performs lot of NLP task on scripts. Tokenization is also important task of NLP to break the text. Tokenization process used to text identifies their text and text count. In this research study focus on tokenization to perform task on Sindhi sentences by using tool and get information retrieval from tool. Corpus used Awami newspaper of Sindhi on the basis of sentence form. Information retrieval based on tool’s response and also helps users to in Simplification, satisfaction, filtration of text and so on. Tokenization task considered as pre-processing task of NLP and produce tokens with token count which is the basis on given input text to information retrieval Tokenization tool. One hundred forty words of Sindhi text and eight sentences were used to get results. In future, perform NLP tasks on Sindhi text by using supervised, Semi-supervised and unsupervised machine learning

How to Cite

Sodhar, I. N. ., Jalbani, A. H. ., Buller, A. H. ., & Sodhar, A. N. . (2021). Tokenization of Sindhi Text on Information Retrieval Tool. Pakistan Journal of Emerging Science and Technologies (PJEST), 1(1), 1–7. https://doi.org/10.58619/pjest.v1i1.130

Issue

Vol. 1 No. 1 (2020)

Section

Articles

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Pakistan Journal Emerging Science and Technologies (PJEST) in collaboration with Govt. Islamia Graduate College Civil Lines Lahore, Pakistan is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

W. Ali, N. Ali, and S. Tumrani, “Creating and Evaluating Resources for Sentiment Analysis in the Low-resource Language : Sindhi,” pp. 188–194, 2021.

Q. Talpur, I. Kakepoto, and K. B. Jalbani, “Engineering Students Perceptions about English Language Teachers Code Switching from English to Sindhi Language Engineering Students Perceptions about English Language Teachers Code Switching from English to Sindhi

Language,” no. April, 2021.

Z. Bhatti, I. A. Ismaili, W. J. Soomro, and D. N. Hakro, “Word Segmentation Model for Sindhi Text,” vol. 2, no. 1, pp. 1–7, 2014, doi: 10.12691/ajcrr-2-1-1.

S. K. Srivastava, “Applications of Intelligent Agents,” Electron. Inf. Plan., vol. 26, no. 5, pp. 273–281, 1999, doi: 10.1007/978-3-662-03678-5_1.

T. R. Soomro& S. M. Ghulam, “Current Status of Urdu on Twitter,” Sukkur IBA J. Comput. Math. Sci., 2019, doi: 10.30537/sjcms.v3i1.397.

I. N. Sodhar, A. H. Jalbani, A. H. Buller, M. I. Channa, and D. N. Hakro, “Sentiment analysis of Romanized Sindhi text,” J. Intell. Fuzzy Syst., vol. 38, no. 5, pp. 5877–5883, 2020, doi: 10.3233/JIFS-179675.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Identification of issues and challenges in romanized Sindhi text,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 9, pp. 229– 233, 2019, doi: 10.14569/ijacsa.2019.0100929.

I. N. Sodhar, A. H. Jalbani, and A. H. Buller, “An Empirical And Statistical Study On Pos Tagging Of Sindhi Social Media Text,” vol. 241, pp. 72–81, 2020.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Parts of Speech Tagging of Romanized Sindhi Text by applying Rule Based Model,” vol. 19, no. 11, pp. 91–96, 2019.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Romanized Sindhi Rules for Text Communication,” vol. 40, no. 2, pp. 298–304, 2021, doi: 10.22581/muet1982.2102.04.

G. Salton and J. McGill, Michael, “Information Retrieval: an Introduction,” in Introduction to modern information retrieval, 1983.

K. Jarvelin and J. Kekalainen, “IR evaluation methods for retrieving highly relevant documents,” SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), 2000, doi: 10.1145/3130348.3130374.

B. Saini, V. Singh, and S. Kumar, “Information retrieval models and searching methodologies: Survey,” Inf. Retr. Boston., 2014.

“Multimedia Based e-Learning for Educating Children in Sindhi Language,” Sukkur IBA J. Comput. Math. Sci., 2020, doi: 10.30537/sjcms.v4i1.518.

I. N. Sodhar, H. Bhanbhro, Z. H. Amur, A. H. Jalbani, and A. H. Buller, “Sindhi Language Processing on Online SindhiNLP Tool,” vol. 4, no. 3, pp. 4–7, 2020.

I. N. Sodhar, A. H. Jalbani, A. H. Buller, and A. N. Sodhar, “Tools Used In Online Teaching and Learning through Lock - Down,” no. 8, pp. 36–40, 2020.

I. N. Sodhar, A. H. Buller, and A. N. Sodhar, “Identification of Online Statistical Translation and Text Issues in Communication Technologies,” vol. 10, no. 2, pp. 446–452, 2021.

I. H. Sodhar et al., “Information Communication and Technology Tools Integration in Higher Education,” Int. J. Progress. Sci. Technol. (IJPSAT, vol. 15, no. 1, pp. 127–133, 2019, [Online].Available: https://www.researchgate.net/publication/333984007.

N. Otani, S. Ozaki, X. Zhao, Y. Li, M. St Johns, and L. Levin, “Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings,” 2020, doi: 10.18653/v1/2020.emnlp-main.360.

C. Ding et al., “Towards Burmese (Myanmar) morphological analysis: Syllable-based Tokenization and Part-of-speech Tagging,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 2019, doi: 10.1145/3325885.

Article Sidebar

Main Article Content

Abstract

Article Details

References

Most read articles by the same author(s)