CLT14 Shared Task is the first in series that concerns the research
& development of human language technology for
Pakistani languages. The goal of this series is to provide common
and consistent task definitions, datasets and evaluation of the task
and most importantly to trigger research in this important
Shared Task focuses on converting Free Roman Urdu to standard
script (the modified Arabic script used for Urdu). Free
is the Urdu written by using Roman/Latin alphabets. It
is extensively used in SMS, chatting, emails, on social networking
Facebook, Twitter and on discussion forums as well. It
not follow any standard transliteration scheme.
An example of Roman Urdu is:
“ye roman urdu hay”. It will be converted to “یہ رومن اردو ہے ”.
participants are inivited to develop a Free Roman Urdu Transliteration
system, generate transliteration output corresponding to the
test data, and submit the output and a short paper to the orgainzers.
A sample data of approximately 5,000 words will be provided to show the
scope and variety of the test data. However because of its small size,
it is not recommended to consider that sample as training data.
hyperlinks of some websites having Free Roman Urdu are given in the linked
text file. The developers can use these and any other
resources to understand the
scope and variety of Free Roman Urdu.
However, it is not an exhaustive list.
One cannot use a third party transliteration system in the developed
system. However, other resources and libraries for intermediate
processing can be used. The review process of submitted papers will
evaluate the major innovation in the developed system.
Test Data consisting of sentences comprising of approximately 5,000
words will be provided to the registered participants. The participants
will generate the output file (in the format defined in sample data
files) and return it to the organizers.
The evaluation metric is: No. of matched words / Total no. of words
The evaluation system will have list of alternate Urdu script spellings
and all of those alternate spellings will be acceptable. Moreover, a
team of language experts will inspect all the transliteration that are
tagged as wrong by the evaluation software. If any tagged-as-wrong
transliteration is actually a legitimate alternate spelling (that was
not present in alternate spelling list), it will be re-tagged as
selected systems will be invited for demonstration in the conference.
Before public demonstration, the system will be demonstrated
to a panel of software engineers.
prize of 15,000 PKR* will be awarded to the best transliteration
systems will be evaluated for precision as well as innovation.
of the prize
can be increased, if sponsorship is obtained.
Input-Output Format & Sample Data:
smaple input and output files are: input.txt, output.txt
The input file has one sentence or text chunk in a line.
The output file has one word (or a word having space) in a line. The
first line has the word in roman script. The next line has the word in
Urdu second line.
To show the end of line of the each line of text in the input file, a #
is used in the output file.
The sample data (roman Urdu alongwith Urdu script transcription) of
approximately 5,000 words is present in sample.txt. However,
because of its small size,
it is not recommended to consider this sample as training data.
of Sample Data: 7th April,
Registration Deadline: 22nd July, 2014*
Release of Test Data (to registered participants): 2nd August, 2014*
Submission of 3 Page Short Paper (without results): 3rd August, 2014*
Submission of Output: 10th August, 2014*
Notification of Acceptance: 27th August, 2014*
Camera Ready Paper Due: 14th Septemebr, 2014*
* shows that the date is changed.
Shared Task Subcommittee:
any query, contact email@example.com.
Kamran, Charles University, Prague
Mustafa, Centre of Language Engineering,
Muhammad Humayoun, COMSATS
Institute of Information Technology, Lahore
University of Science and Technology, Rawalpindi
DHA Suffa University, Karachi