This project implements a state-of-the-art sequence-to-sequence Neural Machine Translation model using TensorFlow 2.x to generate high-quality paraphrases of English sentences. The model employs an encoder-decoder architecture with Luong attention mechanism and GRU units, trained on the Parabank 100k dataset.
paraphrase-neural-machine-translation/
โโโ ๐ data/ # Dataset directory
โ โโโ parabank_100k.tsv # Parabank dataset
โโโ ๐ logs/ # Training logs and checkpoints
โ โโโ training_checkpoints/ # Model checkpoints
โ โโโ scalars/ # TensorBoard logs
โโโ ๐ models/ # Saved models
โโโ ๐ pkl/ # Preprocessed data pickles
โโโ ๐ src/ # Source code
โ โโโ __init__.py
โ โโโ data.py # Data preprocessing pipeline
โ โโโ models.py # Neural network architectures
โ โโโ train.py # Training orchestration
โ โโโ predict.py # Inference and prediction
โ โโโ utils.py # Utility functions
โ โโโ tests/ # Test suite
โ โโโ test_suite.py
โโโ ๐ config.py # Configuration management
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ pytest.ini # Testing configuration
โโโ ๐ README.md # Project documentation
โโโ ๐ LICENSE # MIT-0 License
Clone the repository
bash
git clone <repository-url>
cd paraphrase-neural-machine-translation
Create virtual environment
bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies
bash
pip install -r requirements.txt
Download the dataset
parabank_100k.tsv in the data/ directorypython src/data.py
Preprocesses the dataset, builds vocabulary, and saves pickled data.
python src/train.py
Trains the model with early stopping and checkpoint saving.
python src/predict.py
Generates paraphrases for test sentences with attention visualization.
tensorboard --logdir logs/scalars
View training metrics at http://localhost:6006
All parameters are centralized in config.py:
# Model parameters
EMBEDDING_DIM = 256
UNITS = 1024
BATCH_SIZE = 64
# Training parameters
EPOCHS = 10000
LEARNING_RATE = 1e-3
PATIENCE = 10
# File paths
DATA_PATH = "./data/parabank_100k.tsv"
CHECKPOINT_DIR = "./logs/training_checkpoints"
Run the comprehensive test suite:
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest src/tests/test_suite.py
The model is evaluated using: - BLEU Score: Measures n-gram overlap with reference paraphrases - Perplexity: Measures model confidence - Attention Quality: Manual inspection of attention weights - Diversity Metrics: Measures paraphrase diversity
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)This project is licensed under the MIT-0 License - see the LICENSE file for details.
If you use this code in your research, please cite:
@software{paraphrase_nmt_2025,
title={Paraphrase Generation with Neural Machine Translation},
author={Your Name},
year={2025},
url={https://github.com/waifuai/paraphrase-neural-machine-translation}
}
BATCH_SIZE in config.pyLEARNING_RATE or UNITSMade with โค๏ธ by WaifuAI
Browse the source repository