Accepted Papers

Abstracts:

  • Md. Sadiqul Islam. HPSG Analysis of Arabic Verb Form Derivation
    Semitic languages exhibit rich nonconcatenative morphological operations, which can generate a myriad of derivational lexemes. Especially, the feature rich, root-driven morphology in Arabic language demonstrates the construction of different trilateral verb stems which are called as verb forms. Although HPSG is a successful syntactic theory, it lacks the representation of complex nonconcatenative morphology. In this paper, I propose a novel HPSG representation of Arabic verb forms. I also present the lexical type hierarchy and derivational rules for generating these verb forms using HPSG framework.

  • Rida Hijab Basit and Sarmad Hussain. Text Processing For Urdu TTS System
    Natural Language Processing plays an important role in any Text to Speech (TTS) system. The raw text given as input to TTS may consist of numbers, dates, time, acronyms or symbols. NLP processes the raw text and converts it in the form that can be used by TTS to generate its corresponding speech. NLP consists of three parts, "Text Processing", "Text Annotation" and "Phonological Annotation". This paper enhances earlier work and details the text processing in NLP from the perspective of Urdu and also reports the results given by NLP.

  • Tania Habib. Spoken Dialog System: Direction Guide for Lahore City
    A Direction Guide Spoken Dialog System is a system that asks an Urdu language user about his current and the destination location over a telephone and then gives direction guidance to the user from the user’s present location to the destination location. A prototype end-to-end system has been developed that is distributed and Hub & Spoke message based system using the open-source GALAXY Communicator. The end-to-end system uses a Telephony framework and a software based infrastructure that involves the modules of Automatic Speech Recognizer, Text-to-Speech Synthesizer, RavenClaw Dialog Manager, a Backend database and an Interaction Manager. Currently the end-to-end system works for a single session and future work includes the multiple session handling as well.

  • Tania Habib. Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language
    This paper describes the development of HMM based speech synthesizer for Urdu language using the HTS-toolkit. It describes the modifications needed to original HTS-Demo-scripts to port them, for Urdu language, which are currently available for English, Japanese and Portuguese. That includes the generation of the full-context style labels and the creation of the Question file for Urdu phone set. For that the development and structure of utilities are discussed. Plus a list of 200 high frequency Urdu words are selected using the greedy search algorithm. Finally the evaluation of these synthesized words is conducted using naturalness and intelligibility scores.

  • Yuichiro Kobayashi. Computer-aided Error Analysis of L2 Spoken English: A Data Mining Approach
    Understanding learners’ errors is significant for language teachers, researchers, and learners. Computer learner corpora enable us to carry out computer-aided error analysis, and as compared to traditional error analysis, it has an advantage in the storing and processing of enormous amounts of information about various aspects of learner language. The present study aims to explore the error patterns across proficiency levels in second language spoken English with data mining techniques. It also attempts to identify error types that can be used to discriminate between English learners at different proficiency levels. Spoken data for the present study were sourced from the NICT JLE Corpus, a computerized learner corpus annotated with 46 different error tags. The results of the present study indicate that there is a substantial difference in the frequencies of five types of errors, namely (a) article errors, (b) lexical verb errors, (c) normal lexical preposition errors, (d) noun number errors, and (e) tense errors, between lower- and upper- level learners. The findings will be useful for L2 learner profiling research and for the development of automated speech scoring systems.

  • Saad Irtza, Khawer Rehman and Sarmad Hussain. Urdu Keyword Spotting System using HMM
    This paper reports the development of Urdu keyword spotting system (KWS). The approach in the development of KWS is based on filler models to account for non-keywords speech intervals. An impact of using different training datasets to develop filler models has been explored. In addition, a phoneme recognizer (PR) based on all phone model automatic speech recognition system (ASR) has been developed on keywords. Training and decoding parameters of KWS system have been tweaked to get the optimum performance. In the end, KWS and PR systems are integrated and string matching algorithm has been used to improve the performance of Urdu keyword spotter system. The overall system accuracy is 94.59% on the data set used.

  • Saba Urooj, Sana Shams, Sarmad Hussain, Farah Adeeba. Sense Tagged CLE Urdu Digest Corpus
    This paper presents the construction of an Urdu Sense Tagged corpus using four main lexical resources; an Urdu wordlist consisting of 5000 high frequency content words, a 100K words corpus annotated with part of speech (POS) tags, an Urdu WordNet with approximately 5058 senses and Urdu morphological analyzer. The paper also briefly presents Urdu Word-Sense Annotation tool, a software tool developed to provide an easy interface for sense tagging, ensuring tagging consistency and accelerating the annotation speed. In this version of the Urdu Sense tagged corpus, 17,006 words have been sense tagged with 2285 unique senses. The final section discusses the linguistic and tool specific challenges in the construction of sense tagged corpus and describes future work in this context.

  • Mohammad Raees and Sehat Ullah. Alphabet Signs Recognition using Pixels-based Analysis
    Sign language is the language of gestures and postures used for non-verbal communication. This paper presents a novel vision based approach for the detection of isolated signs of Pakistan Sign Language (PSL). The signs representing alphabets of Urdu (national language of Pakistan) are recognized by distinguishing fingers. The algorithm, following a model of seven phases, identifies each of the five fingers from their respective positions. After fingers’ recognition, signs are deduced from their states of being raised or down. For quick recognition, signs are categorized into three groups based on the thumb position. Five testers evaluated the system using a simple low cost USB camera in a semi-controlled environment. The results obtained are encouraging as accuracy of the system exceeds a level of 85.4%.

  • Neelam Mukhtar and Mohammad Abid Khan. Aspect based Opinion Mining, a review
    Opinion Mining (OM) is concerned with the retrieval or extraction of information and discovery of knowledge from the text, collected through opinions of people e.g. about different products services, using Natural Language Processing and Data Mining techniques. Researchers are focusing on different areas of opinion mining or sentiment analysis (both terms will be used interchangeably in this review). One very popular area nowadays is aspect based opinion summarization. Researchers have introduced new methods/techniques or used already existing ones to achieve this goal. This paper provides a survey of such attempts highlighting the use of different techniques including Natural Language Processing (NLP) techniques, feature discovery mining techniques for aspect/feature identification, learning and lexicon based methods for sentiment prediction and different ways of summaries that are commonly used to generate an aspect based summary. A frame work containing multiple approaches for opinion summarization is presented. Three common steps (i.e. aspect/feature identification, sentiment prediction and aspect/feature based summary) that are normally taken for aspect based opinion summarization are discussed in detail. An ontology is designed that provides a quick overview of all these phases. The paper is concluded by highlighting the current limitations faced by the researchers in this discipline and thus providing indications for future research.

  • Hala Abdelghany. Prosodic Phrasing and the Parsing of Modifier Attachment Ambiguity in Deep and Shallow Orthography
    This paper presents research that investigates the effect of prosodic phrasing on syntactic parsing. The focus is on the ambiguity of a modifier (relative clause or adjective phrase) in relation to the two nouns in a complex noun phrase in Arabic. Ambiguity resolution tendencies for this construction differ across languages. These effects have been shown to occur even in silent reading, so the suggestion is that the parser projects onto a text a default prosodic phrasing which then influences the final syntactic parse and semantic analysis of the sentence. The structure of Arabic permits use of a method for tapping into implicit prosodic boundaries. Liaison, a phonological process occurring across word boundaries, is sensitive to patterns of prosodic chunking in Arabic. These liaison phenomena make the phonological phrasing of Arabic sentences easy to detect in listening. But also, they are indicated by diacritics in the ‘vowelized’ version of Arabic orthography which simulates the overt prosody. Implications for the use of this method in text/speech syntactic annotation, tree banking, and semantic analysis is discussed as well as implications for building an interface ontology that characterizes the principled interaction of a prosodic and syntactic derivation in sentence parsing.

  • Afsheen, Saad Irtza, Mahwish Farooq and Sarmad Hussain. Accent Recognition of Punjabi, Urdu, Sindhi, Seraiki and Pashto languages
    Automatic Speech Recognition (ASR) is a key component in Human Computer Interaction (HCI) applications. Stability of ASR systems largely depends on accent, gender, age of speakers, background noise and channel variations. In this paper, a study has been conducted to explore the differences between five different accents of Pakistan i.e. Punjabi, Urdu, Sindhi, Seraiki and Pashto. Speech data has been collected from native speakers of these accents. The five accents have been analyzed using mel frequency cepstral coefficient (MFCC’s) and feature formants of all the vowels in the vocabulary.

  • Benazir Mumtaz, Amen Hussain, Sarmad Hussain, Afia Mahmood, Rashida Bhatti, Mahwish Farooq, Sahar Rauf . Multitier Annotation of Urdu Speech Corpus
    This paper describes the multi-level annotation process of Urdu speech corpus and its quality assessment using PRAAT. The annotation of speech corpus has been done at phoneme, word, syllable and break index levels. Phoneme, word and break index level annotation has been done manually by trained linguists whereas syllable-tier annotation has been done automatically using template matching algorithm. On average the accuracy achieved at phoneme and break-index tiers is 79% and 89% respectively. The quality assessment of word and syllable tiers is still under investigation.

  • Samabia Tehsin, Asif Masood and Sumaira Kausar. A Gestalt’s Theory Based Fuzzy Text Tracking Method
    Video text can play a vital role in content-based video indexing, retrieval and analysis. Text tracking can speed up the text extraction process for videos. In this paper a methodology for text tracking approach is proposed for efficient multimedia indexing and retrieval. The method has experimentally proposed Fuzzy logic based text tracking for videos. Multiple inputs single output rule-based fuzzy Inference Engine is applied for text tracking. Subsequent examination on diverse dataset bears testimony to the fact that the proposed method proved instrumental to text tracking in videos.

  • Qurat-Ul-Ain Akram, Sarmad Hussain, Farah Adeeba, Shafiq-Ur-Rehman and Mehreen Saeed. Framework of Urdu Nastalique Optical Character Recognition System
    The development of Urdu Nastalique OCR is a challenging task due to cursive nature of Urdu, complexities of Nastalique writing style and layouts of Urdu document images. In this paper, the framework of Urdu Nastalique OCR is presented. The presented system supports the recognition of Urdu Nastalique document images having font size between 14 to 44. The system has 86.15% ligature recognition accuracy tested on 224 document images.

  • Nabil Khoufi, Chafik Aloulou and Lamia Hadrich Belguith. Arabic Text Chunking Using a CRF Model
    Chunking or shallow syntactic parsing is proving to be a task of interest to many natural language processing applications. The problem gets worse for the Arabic language because of its specific features that make it quite different and even more ambiguous than other natural languages when processed. In this paper, we present a method for chunking Arabic texts based on supervised learning. We use the Conditional Random Fields algorithm and the Penn Arabic Treebank to train the model. For the experimentation, we use over than 10,100 sentences as training data and 2,524 sentences for the test. The evaluation of the method consists of the calculation of the generated model accuracy and the results are very encouraging.

  • Wajiha Habib, Rida Hijab Basit, Sarmad Hussain and Farah Adeeba. Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm
    Unit selection speech synthesis is one of the most widely used techniques for high quality text to speech (TTS) systems. A unit selection text to speech system requires a large database of recorded and annotated speech which contains both phonetic and prosodic variations. Designing phonetically rich and balanced speech corpora with minimum number of utterances is an intricate task. Several optimization methods are used for this purpose and "Greedy algorithm" is one of them. This paper introduces a greedy algorithm which maximizes the coverage of high frequency unigrams, bigrams and trigrams while selecting minimal number of sentences from input corpus. The algorithm has been applied on different corpora collected from different domains and a speech corpus for Urdu TTS system is designed. A significant coverage of tri-phone has also been achieved.

  • Ayesha Zafar, Afia Mahmood, Sana Shams and Sarmad Hussain. Structural Analysis of Linking Urdu WordNet to PWN 2.1
    Multiple cross language WordNets such as Euro WordNet (EWN), Multi WordNet, Asian WordNet and Indo WordNet, have been developed that involve mapping Princeton WordNet (PWN) with the respective language WordNet [1,2,3,4,5]. Majority of these projects have employed the transfer-and-merge method developed during the construction of Euro WordNet for WordNet linkage. This paper discusses the process, challenges and results of linking CLE Urdu WordNet , to the Princeton WordNet Version 2.1 from a linguistic and lexicographic perspective. Based on the synset alignment experience, cross language (Urdu – English) linkage issues have been highlighted followed by a contextual strategy for the resolution. Urdu language concepts that could not be aligned with the PWN 2.1 are also highlighted and discussed.

  • Muhammad Junaid and Kamran Ghani. An optimized Pashto Keyboard Layout based on Character Frequency
    Pashto is spoken by an estimated 40 to 50 million people in the world and has a very rich culture and literary history. However, on the computing side, it is not at par with other popular regional languages. Many areas need to be explored. In this paper, we have proposed the optimization of Pashto keyboard layout. Though many Pashto keyboard layouts are already in common use, no standard keyboard layout is available. Furthermore, none of these have been developed on sound scientific basis. The proposed work is for Afghani Pashto and is based on similar work done for other languages. To evaluate the proposed layout, along with other factors, a scoring system is also suggested. Analysis shows that the proposed layout is superior to the existing layouts in many respects.

  • Tafseer Ahmed and Naila Ata. What's in a name? Automatic extraction of lexical and functional units of Pakistani names
    The paper describes a two pass POS-tagging system for the extraction of first name and surname from a Pakistani (full) name string. The full name in Pakistan does not follow a single fixed pattern. The order of its component is flexible, and the simple pattern of first-name middle-name last-name is not applicable. There are many peculiarities e.g. in the absence of family name, the middle-name serve as the surname. To extract first name and surname, two sets of POS tags are designed. The first tagset consists of personal-name, family-name, religious-middle-name, particle and title. The second tagset consists of first-name, surname, title and middle-name. The output of the first pos tagging subsystem is fed to the second subsystem. The evaluation gives 90+% accuracy by using POS tagging.

  • Farah Adeeba, Qurat-Ul-Ain Akram, Hina Khalid and Sarmad Hussain. Urdu Books N-grams
    The paper presents the development of first publically available Urdu N-grams extracted from different books. For the better representation of N-grams, large amount Urdu corpus is collected from books covering different domains. The automatic cleaning of 37 million Urdu books corpus is discussed. The domain-wise N-grams are extracted which can be used in different Natural Language Processing and Information Retrieval applications.

  • Tafseer Ahmed. Extracting Arguments and Collocations for Urdu Complex Predicates
    The paper presents the automated extraction of arguments and collocations for Noun+Verb (N+V) e.g. safAI kar- ‘cleaning.noun do’ and Adjective+Verb (A+V) e.g. sAf kar- ‘clean.adj do’ complex predicates (cp) of Urdu. An automatically POS tagged corpus of 97 million words was processed, and the pseudo- relations of nouns and complex predicates are extracted by a devised algorithm (without using deep parsing or chunking). The words of pseudo-relations are processed to suggest the collocations for each complex predicate. For a given cp, the commonly used words in subject, object, genitive modifier of N+V, non-canonical second argument (NCSA) and V+V light verbs are extracted, if the argument exists for (or relevant to) that cp. In the absence of big and freely available Urdu treebank, the paper describe an alternate method to get argument structures and collocation of complex predicates. The pseudo-relation extractor can also be further used in information extraction tasks.