PAN Localization Project - Phase I
In Phase I, we delivered two electronic resources and two commercial grade applications for free for non-commercial use.
- Sinhala corpus of 10 million words
- Sinhala lexicon with translations to Tamil and English
- Commercial-grade Sinhala Text-To-Speech system
- Commercial-grade Sinhala OCR software
Corpus & Corpus Analysis Tool
The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.
Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus.
Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under downloads for anyone who wishes to use it.
The corpus went through following steps in growing into current state.
- Collating government documents
- Negotiating and collecting publisher content
- Collecting archived web content
- Computerizing non-electronic content from above sources in UNICODE
- Identifying frequent "font encodings" for non-UNICODE content
- Building mappings for converting these to UNICODE
- Converting all electronic content to UNICODE
- Compiling the corpus
The lexicon contains a list of more than 25,000 Sinhala words together with some grammatical features. The features identified currently are, the part-of-speech, number and gender, but may be extended as the requirements arise. In addition, this lexicon contains English & Tamil translations for corresponding Sinhala words providing a resource for language translation work. This resource, too, is available for download.
The process of creating the lexical resource included
- Collecting dictionary data in printed formats
- Computerizing these content in UNICODE
- Collecting dictionary data in electronic formats
- Converting all electronic content to UNICODE
- Extracting information by parsing data
- Correcting errors and typos
- Compiling the lexicon
- Building interface applications to the lexicon
Text To Speech (TTS) System
While there were some experimental TTS systems by the UCSC for Sinhala are already under work, the aim of this project was to produce one that is of commercial quality. To this end, considerable effort was be spent on quality aspects of this activity. Apart from identifying the phonetic alphabet of the language, recording relevant word sentences in the database and building a text analysis component, the project also produced a synthesizing engine that facilitates natural sounding Sinhala voice. This application is available for download.
The basic methodology adopted is based on the diphone concatenation approach to TTS and included following components and procedures in developing them.
- Text analysis component:
- Studying types of non-textual content and how to convert them to text
- Defining the text analysis interface
- Building the text analysis component
- Phonetic component:
- Studying the phonology and phonetics of Sinhala
- Identifying the phonetic vocabulary
- Constructing word sentences for recording most common diphones
- Defining phonetic processor components
- Building the diphone database
- Building the phonetic processor
- Integrating all components and producing the TTS system.
Optical Character Recognition System
Previous works at the UCSC concerning OCRs had concentrated on developing a technique best suited for detecting printed Sinhala characters. This component of work focused on converting that research into a real product by making it robust for variations in font size, particularly those commonly used by the majority of the people including newspaper prints and government publications. Later it will be developed into a font-independent OCR software.
You can download this software from the download section.
The methodology used for construction of the OCR consisted of following steps
- Preprocessing activities:
- Scanning documents and skew detection
- Noise detection and removal
- Extraction of text characteristics and individual characters
- Data collection:
- Identification of representative texts
- Separation of training, validation and testing sets
- Feature extraction and pattern matching
- Testing of competing algorithms
- Optimization of algorithms
- Application development