Paraphrase Generation with Neural Machine Translation (NMT)

Paraphrase Generation with Neural Machine Translation (NMT)

Python TensorFlow License

This project implements a state-of-the-art sequence-to-sequence Neural Machine Translation model using TensorFlow 2.x to generate high-quality paraphrases of English sentences. The model employs an encoder-decoder architecture with Luong attention mechanism and GRU units, trained on the Parabank 100k dataset.

๐ŸŽฏ Features

๐Ÿ“Š Model Performance

๐Ÿ—๏ธ Project Structure

paraphrase-neural-machine-translation/
โ”œโ”€โ”€ ๐Ÿ“ data/                          # Dataset directory
โ”‚   โ””โ”€โ”€ parabank_100k.tsv            # Parabank dataset
โ”œโ”€โ”€ ๐Ÿ“ logs/                          # Training logs and checkpoints
โ”‚   โ”œโ”€โ”€ training_checkpoints/        # Model checkpoints
โ”‚   โ””โ”€โ”€ scalars/                     # TensorBoard logs
โ”œโ”€โ”€ ๐Ÿ“ models/                        # Saved models
โ”œโ”€โ”€ ๐Ÿ“ pkl/                           # Preprocessed data pickles
โ”œโ”€โ”€ ๐Ÿ“ src/                           # Source code
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ data.py                      # Data preprocessing pipeline
โ”‚   โ”œโ”€โ”€ models.py                    # Neural network architectures
โ”‚   โ”œโ”€โ”€ train.py                     # Training orchestration
โ”‚   โ”œโ”€โ”€ predict.py                   # Inference and prediction
โ”‚   โ”œโ”€โ”€ utils.py                     # Utility functions
โ”‚   โ””โ”€โ”€ tests/                       # Test suite
โ”‚       โ””โ”€โ”€ test_suite.py
โ”œโ”€โ”€ ๐Ÿ“„ config.py                      # Configuration management
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt               # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“„ pytest.ini                     # Testing configuration
โ”œโ”€โ”€ ๐Ÿ“„ README.md                      # Project documentation
โ””โ”€โ”€ ๐Ÿ“„ LICENSE                        # MIT-0 License

๐Ÿš€ Quick Start

Prerequisites

Installation

  1. Clone the repository bash git clone <repository-url> cd paraphrase-neural-machine-translation

  2. Create virtual environment bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

  3. Install dependencies bash pip install -r requirements.txt

  4. Download the dataset

  5. Download the Parabank 100k dataset from [source]
  6. Place parabank_100k.tsv in the data/ directory

Usage

Data Preparation

python src/data.py

Preprocesses the dataset, builds vocabulary, and saves pickled data.

Training

python src/train.py

Trains the model with early stopping and checkpoint saving.

Inference

python src/predict.py

Generates paraphrases for test sentences with attention visualization.

Monitoring Training

tensorboard --logdir logs/scalars

View training metrics at http://localhost:6006

๐Ÿ”ง Configuration

All parameters are centralized in config.py:

# Model parameters
EMBEDDING_DIM = 256
UNITS = 1024
BATCH_SIZE = 64

# Training parameters
EPOCHS = 10000
LEARNING_RATE = 1e-3
PATIENCE = 10

# File paths
DATA_PATH = "./data/parabank_100k.tsv"
CHECKPOINT_DIR = "./logs/training_checkpoints"

๐Ÿงช Testing

Run the comprehensive test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test file
pytest src/tests/test_suite.py

๐Ÿ“ˆ Evaluation Metrics

The model is evaluated using: - BLEU Score: Measures n-gram overlap with reference paraphrases - Perplexity: Measures model confidence - Attention Quality: Manual inspection of attention weights - Diversity Metrics: Measures paraphrase diversity

๐Ÿ” Model Architecture

Encoder

Decoder

Attention Mechanism

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT-0 License - see the LICENSE file for details.

๐Ÿ“š Citation

If you use this code in your research, please cite:

@software{paraphrase_nmt_2025,
  title={Paraphrase Generation with Neural Machine Translation},
  author={Your Name},
  year={2025},
  url={https://github.com/waifuai/paraphrase-neural-machine-translation}
}

๐Ÿ†˜ Troubleshooting

Common Issues

  1. Out of Memory: Reduce BATCH_SIZE in config.py
  2. Training Convergence Issues: Adjust LEARNING_RATE or UNITS
  3. Poor Paraphrase Quality: Increase model capacity or training data
  4. TensorBoard Not Working: Check if port 6006 is available

Getting Help

๐Ÿ”„ Changelog

Version 2.0.0


Made with โค๏ธ by WaifuAI

Source Code

Browse the source repository