Diving Deep into Low-Resource NLP: Building a BiLSTM Seq2Seq Model for Urdu Ghazals

Advancing NLP: Transliterating Urdu Poetry to Roman Urdu with BiLSTM

Hello, NLP enthusiasts! I'm thrilled to share my recent journey in building a Neural Machine Translation (NMT) model for a unique challenge: transliterating Urdu poetry into Roman Urdu. This project wasn’t just about coding a standard encoder-decoder model—it was an exploration of low-resource, poetic text using a classic BiLSTM architecture. Let’s dive into the details!

The Goal: Capturing the Essence of Urdu Poetry

The objective was to develop a sequence-to-sequence model using a 2-layer bidirectional LSTM (BiLSTM) encoder and a 4-layer LSTM decoder to transliterate Urdu Ghazals from the urdu_ghazals_rekhta dataset into Roman Urdu. Urdu, a morphologically rich language written in Arabic script, presents unique challenges, especially in its poetic form, where figurative language and cultural nuances abound. The task was to map Urdu text to its Roman Urdu equivalent, preserving phonetic accuracy while navigating variations in Romanization.

Why is this challenging? Transliteration requires capturing the sound of Urdu words in Latin script, handling inconsistencies in spelling (e.g., "khuda" vs. "khudaa") and the poetic complexity of Ghazals, which often defy standard linguistic patterns.

The Dataset: urdu_ghazals_rekhta

The urdu_ghazals_rekhta dataset formed the backbone of this project. Organized by poet (e.g., poet1, poet2), it includes subfolders with Urdu (ur), Roman Urdu (en), and Hindi (hi) versions of Ghazals. The aligned Urdu and Roman Urdu files provided parallel sentence pairs, eliminating the need for rule-based conversion—a common hurdle in transliteration tasks.

Example Pair:

  • Urdu (Source): میں اس کا ہوں جس کا نام ہے خدا
  • Roman Urdu (Target): main us kaa hoon jis kaa naam hai khuda

Step 1: Preprocessing – Setting the Stage

Clean, standardized data is critical for effective model training. Here’s how I prepared the dataset:

  • Cleaning: Removed extra whitespace and ensured consistency. While the dataset was relatively clean, I noted that broader character normalization (e.g., handling Urdu letter variations or punctuation) could enhance robustness for messier datasets.
  • Tokenization: Used SentencePiece with Byte-Pair Encoding (BPE) to tokenize Urdu and Roman Urdu into subword units. This approach reduces vocabulary size and handles out-of-vocabulary words effectively. For example:
    • Urdu: کتاب (book) → ['کت', 'اب']
    • Roman Urdu: kitab → ['kit', 'ab']
      The marker denotes the start of a word, aiding the model in learning word boundaries.

Step 2: Model Architecture – BiLSTM Encoder-Decoder

The core of the project was a sequence-to-sequence model with a specific configuration mandated by the coursework:

  • Encoder: A 2-layer BiLSTM, capturing bidirectional context (past and future) for each token in the Urdu input, producing a rich context vector.
  • Decoder: A 4-layer LSTM, generating Roman Urdu tokens sequentially, conditioned on the encoder’s context and its own prior outputs.

This architecture, while not as advanced as modern Transformers, was ideal for exploring the limits of classical RNNs on a low-resource, poetic task.

Step 3: Training & Experimentation – Optimizing Performance

Training involved splitting the dataset (50% train, 25% validation, 25% test) and optimizing with Cross-Entropy Loss and the Adam optimizer in PyTorch. The real excitement came from hyperparameter tuning, where I explored:

  • Embedding Dimension: 128, 256, 512
  • Hidden Size: 256, 512
  • Dropout Rate: 0.1, 0.3, 0.5
  • Learning Rate: 1e-3, 5e-4, 1e-4
  • Batch Size: 32, 64, 128

Insights: A lower learning rate (5e-4) often stabilized validation loss, while higher rates (1e-3) risked instability. Lower dropout (0.1) accelerated training but increased overfitting risk. These experiments revealed the delicate balance of hyperparameters in optimizing model performance.

Step 4: Evaluation – Measuring Success

Evaluating transliteration required metrics tailored to the task:

  • BLEU Score: Assessed n-gram overlap between predicted and reference Roman Urdu, adapted for transliteration quality.
  • Perplexity: Measured the model’s ability to predict the next token, with lower values indicating better generalization.
  • Character Error Rate (CER): Calculated edit distance (insertions, deletions, substitutions) between predicted and reference strings, normalized by reference length, emphasizing character-level accuracy.

Step 5: Qualitative Examples – The Model in Action

Quantitative metrics tell part of the story, but qualitative outputs bring the model to life. Here are sample predictions:

Urdu Input

Ground Truth

Model Prediction

کیا حال ہے؟

kya haal hai?

kya haal hai?

السلام علیکم

assalam o alaikum

assalam o alaikum

میں اس کا ہوں جس کا نام ہے خدا

main us kaa hoon jis kaa naam hai khuda

main us ka hun jis ka naam hai khuda

Note: Output quality varied based on model configuration and training runs, highlighting areas for further refinement.

Step 6: Deployment – Bringing It to Life

To make the model accessible, I deployed it using Streamlit, creating an interactive web app where users can input Urdu text and view real-time Roman Urdu transliterations. Try it here or explore the code on GitHub. This deployment bridges the gap between research and practical application.

Conclusion: Lessons from Low-Resource NLP

This BiLSTM-based NMT model for Urdu Ghazals was a deep dive into low-resource NLP challenges. While Transformers often dominate modern NLP, this project underscored the value of classical architectures in understanding preprocessing, tokenization, and hyperparameter tuning. Working with poetic text revealed unique complexities, from alignment issues to phonetic variations, making this a rewarding exploration of foundational NLP.

What are your experiences with low-resource languages or poetic text in NLP? Share your thoughts below—I’d love to connect and discuss!

 

link:Urdu Roman - a Hugging Face Space by abdullahzayn

devloped by :Abdullah zain




Post a Comment

Post a Comment (0)

Previous Post Next Post