Advancing NLP: Transliterating Urdu Poetry to Roman Urdu with BiLSTM
Hello, NLP enthusiasts! I'm thrilled to
share my recent journey in building a Neural Machine Translation (NMT) model
for a unique challenge: transliterating Urdu poetry into Roman Urdu. This
project wasn’t just about coding a standard encoder-decoder model—it was an
exploration of low-resource, poetic text using a classic BiLSTM architecture.
Let’s dive into the details!
The Goal: Capturing the Essence of Urdu Poetry
The objective was to develop a
sequence-to-sequence model using a 2-layer bidirectional LSTM (BiLSTM) encoder
and a 4-layer LSTM decoder to transliterate Urdu Ghazals from the urdu_ghazals_rekhta
dataset into Roman Urdu. Urdu, a morphologically rich language written in
Arabic script, presents unique challenges, especially in its poetic form, where
figurative language and cultural nuances abound. The task was to map Urdu text
to its Roman Urdu equivalent, preserving phonetic accuracy while navigating
variations in Romanization.
Why is this challenging? Transliteration requires capturing the sound of Urdu words in Latin
script, handling inconsistencies in spelling (e.g., "khuda" vs.
"khudaa") and the poetic complexity of Ghazals, which often defy
standard linguistic patterns.
The Dataset: urdu_ghazals_rekhta
The urdu_ghazals_rekhta dataset formed
the backbone of this project. Organized by poet (e.g., poet1, poet2), it
includes subfolders with Urdu (ur), Roman Urdu (en), and Hindi (hi) versions of
Ghazals. The aligned Urdu and Roman Urdu files provided parallel sentence
pairs, eliminating the need for rule-based conversion—a common hurdle in
transliteration tasks.
Example Pair:
- Urdu (Source): میں اس کا ہوں جس کا نام ہے خدا
- Roman Urdu (Target): main us kaa hoon jis kaa naam hai khuda
Step 1: Preprocessing – Setting the Stage
Clean, standardized data is critical for
effective model training. Here’s how I prepared the dataset:
- Cleaning:
Removed extra whitespace and ensured consistency. While the dataset was
relatively clean, I noted that broader character normalization (e.g.,
handling Urdu letter variations or punctuation) could enhance robustness
for messier datasets.
- Tokenization: Used SentencePiece with Byte-Pair Encoding
(BPE) to tokenize Urdu and Roman Urdu into subword units. This approach
reduces vocabulary size and handles out-of-vocabulary words effectively.
For example:
- Urdu: کتاب (book) → ['▁کت', 'اب']
- Roman Urdu: kitab → ['▁kit', 'ab']
The ▁ marker denotes the start of a word, aiding the model in learning word boundaries.
Step 2: Model Architecture – BiLSTM Encoder-Decoder
The core of the project was a
sequence-to-sequence model with a specific configuration mandated by the
coursework:
- Encoder: A
2-layer BiLSTM, capturing bidirectional context (past and future) for each
token in the Urdu input, producing a rich context vector.
- Decoder: A
4-layer LSTM, generating Roman Urdu tokens sequentially, conditioned on
the encoder’s context and its own prior outputs.
This architecture, while not as advanced
as modern Transformers, was ideal for exploring the limits of classical RNNs on
a low-resource, poetic task.
Step 3: Training & Experimentation – Optimizing Performance
Training involved splitting the dataset
(50% train, 25% validation, 25% test) and optimizing with Cross-Entropy Loss
and the Adam optimizer in PyTorch. The real excitement came from hyperparameter
tuning, where I explored:
- Embedding Dimension: 128, 256, 512
- Hidden Size: 256, 512
- Dropout Rate: 0.1, 0.3, 0.5
- Learning Rate: 1e-3, 5e-4, 1e-4
- Batch Size: 32, 64, 128
Insights: A lower learning rate (5e-4) often stabilized validation loss, while
higher rates (1e-3) risked instability. Lower dropout (0.1) accelerated
training but increased overfitting risk. These experiments revealed the
delicate balance of hyperparameters in optimizing model performance.
Step 4: Evaluation – Measuring Success
Evaluating transliteration required
metrics tailored to the task:
- BLEU Score: Assessed n-gram overlap between predicted
and reference Roman Urdu, adapted for transliteration quality.
- Perplexity: Measured the model’s ability to predict the
next token, with lower values indicating better generalization.
- Character Error Rate (CER): Calculated edit distance (insertions,
deletions, substitutions) between predicted and reference strings,
normalized by reference length, emphasizing character-level accuracy.
Step 5: Qualitative Examples – The Model in Action
Quantitative metrics tell part of the
story, but qualitative outputs bring the model to life. Here are sample
predictions:
Urdu Input |
Ground Truth |
Model Prediction |
کیا حال
ہے؟ |
kya haal hai? |
kya haal hai? |
السلام
علیکم |
assalam o alaikum |
assalam o alaikum |
میں اس کا
ہوں جس کا نام ہے خدا |
main us kaa hoon jis kaa naam hai khuda |
main us ka hun jis ka naam hai khuda |
Note: Output
quality varied based on model configuration and training runs, highlighting
areas for further refinement.
Step 6: Deployment – Bringing It to Life
To make the model accessible, I deployed
it using Streamlit, creating an interactive web app where users can input Urdu
text and view real-time Roman Urdu transliterations. Try it here or explore the
code on GitHub. This deployment bridges the gap between research and practical
application.
Conclusion: Lessons from Low-Resource NLP
This BiLSTM-based NMT model for Urdu
Ghazals was a deep dive into low-resource NLP challenges. While Transformers
often dominate modern NLP, this project underscored the value of classical
architectures in understanding preprocessing, tokenization, and hyperparameter
tuning. Working with poetic text revealed unique complexities, from alignment
issues to phonetic variations, making this a rewarding exploration of
foundational NLP.
What are your experiences with
low-resource languages or poetic text in NLP? Share your thoughts below—I’d
love to connect and discuss!
link:Urdu Roman - a Hugging Face Space by abdullahzayn
devloped by :Abdullah zain
Post a Comment