Tokenization of Sindhi Text on Information Retrieval Tool

Main Article Content

Irum Naz Sodhar
Akhtar Hussain Jalbani
Abdul Hafeez Buller
Anam Naz Sodhar

Abstract

In the artificial intelligence divided into sub fields, Natural language processing (NLP) is also field of AI and performs lot of NLP task on scripts. Tokenization is also important task of NLP to break the text. Tokenization process used to text identifies their text and text count. In this research study focus on tokenization to perform task on Sindhi sentences by using tool and get information retrieval from tool. Corpus used Awami newspaper of Sindhi on the basis of sentence form. Information retrieval based on tool’s response and also helps users to in Simplification, satisfaction, filtration of text and so on. Tokenization task considered as pre-processing task of NLP and produce tokens with token count which is the basis on given input text to information retrieval Tokenization tool. One hundred forty words of Sindhi text and eight sentences were used to get results. In future, perform NLP tasks on Sindhi text by using supervised, Semi-supervised and unsupervised machine learning

Article Details

How to Cite
Sodhar, I. N. ., Jalbani, A. H. ., Buller, A. H. ., & Sodhar, A. N. . (2021). Tokenization of Sindhi Text on Information Retrieval Tool. Pakistan Journal of Emerging Science and Technologies (PJEST), 1(1), 1–7. https://doi.org/10.58619/pjest.v1i1.130
Section
Articles

References

W. Ali, N. Ali, and S. Tumrani, “Creating and Evaluating Resources for Sentiment Analysis in the Low-resource Language : Sindhi,” pp. 188–194, 2021.

Q. Talpur, I. Kakepoto, and K. B. Jalbani, “Engineering Students Perceptions about English Language Teachers Code Switching from English to Sindhi Language Engineering Students Perceptions about English Language Teachers Code Switching from English to Sindhi

Language,” no. April, 2021.

Z. Bhatti, I. A. Ismaili, W. J. Soomro, and D. N. Hakro, “Word Segmentation Model for Sindhi Text,” vol. 2, no. 1, pp. 1–7, 2014, doi: 10.12691/ajcrr-2-1-1.

S. K. Srivastava, “Applications of Intelligent Agents,” Electron. Inf. Plan., vol. 26, no. 5, pp. 273–281, 1999, doi: 10.1007/978-3-662-03678-5_1.

T. R. Soomro& S. M. Ghulam, “Current Status of Urdu on Twitter,” Sukkur IBA J. Comput. Math. Sci., 2019, doi: 10.30537/sjcms.v3i1.397.

I. N. Sodhar, A. H. Jalbani, A. H. Buller, M. I. Channa, and D. N. Hakro, “Sentiment analysis of Romanized Sindhi text,” J. Intell. Fuzzy Syst., vol. 38, no. 5, pp. 5877–5883, 2020, doi: 10.3233/JIFS-179675.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Identification of issues and challenges in romanized Sindhi text,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 9, pp. 229– 233, 2019, doi: 10.14569/ijacsa.2019.0100929.

I. N. Sodhar, A. H. Jalbani, and A. H. Buller, “An Empirical And Statistical Study On Pos Tagging Of Sindhi Social Media Text,” vol. 241, pp. 72–81, 2020.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Parts of Speech Tagging of Romanized Sindhi Text by applying Rule Based Model,” vol. 19, no. 11, pp. 91–96, 2019.

I. N. Sodhar, A. H. Jalbani, M. I. Channa, and D. N. Hakro, “Romanized Sindhi Rules for Text Communication,” vol. 40, no. 2, pp. 298–304, 2021, doi: 10.22581/muet1982.2102.04.

G. Salton and J. McGill, Michael, “Information Retrieval: an Introduction,” in Introduction to modern information retrieval, 1983.

K. Jarvelin and J. Kekalainen, “IR evaluation methods for retrieving highly relevant documents,” SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), 2000, doi: 10.1145/3130348.3130374.

B. Saini, V. Singh, and S. Kumar, “Information retrieval models and searching methodologies: Survey,” Inf. Retr. Boston., 2014.

“Multimedia Based e-Learning for Educating Children in Sindhi Language,” Sukkur IBA J. Comput. Math. Sci., 2020, doi: 10.30537/sjcms.v4i1.518.

I. N. Sodhar, H. Bhanbhro, Z. H. Amur, A. H. Jalbani, and A. H. Buller, “Sindhi Language Processing on Online SindhiNLP Tool,” vol. 4, no. 3, pp. 4–7, 2020.

I. N. Sodhar, A. H. Jalbani, A. H. Buller, and A. N. Sodhar, “Tools Used In Online Teaching and Learning through Lock - Down,” no. 8, pp. 36–40, 2020.

I. N. Sodhar, A. H. Buller, and A. N. Sodhar, “Identification of Online Statistical Translation and Text Issues in Communication Technologies,” vol. 10, no. 2, pp. 446–452, 2021.

I. H. Sodhar et al., “Information Communication and Technology Tools Integration in Higher Education,” Int. J. Progress. Sci. Technol. (IJPSAT, vol. 15, no. 1, pp. 127–133, 2019, [Online].Available: https://www.researchgate.net/publication/333984007.

N. Otani, S. Ozaki, X. Zhao, Y. Li, M. St Johns, and L. Levin, “Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings,” 2020, doi: 10.18653/v1/2020.emnlp-main.360.

C. Ding et al., “Towards Burmese (Myanmar) morphological analysis: Syllable-based Tokenization and Part-of-speech Tagging,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 2019, doi: 10.1145/3325885.

Most read articles by the same author(s)