paraphrase-gan

Project Description

API-Based Paraphrase Prompt Refinement (OpenRouter)

This project uses OpenRouter to iteratively generate paraphrases and refine the generation prompt based on classification feedback. OpenRouter is the provider.

Core Idea: A loop generates paraphrases for input phrases using a dynamic prompt, classifies them (human vs machine), and then refines the generator prompt for the next iteration based on the classification results.

Default Provider and Model Files: - ~/.model-openrouter: model id for OpenRouter (default: openrouter/free)

Provider: OpenRouter

Credentials: - OpenRouter: OPENROUTER_API_KEY or ~/.api-openrouter

Project Structure

Setup and Installation

  1. Prerequisites:

    • Python 3.8 or higher
    • pip
    • Git
    • uv (Optional, but recommended for environment management as per instructions)
    • OpenRouter API Key: Obtain an API key from OpenRouter (https://openrouter.ai/).
  2. Clone the repository:

    bash git clone <repository_url> cd <repository_directory>

  3. Create and activate a virtual environment:

    ```bash

    Using uv (recommended based on instructions)

    Ensure uv is installed (e.g., pip install uv or python -m pip install uv)

    python -m uv venv .venv

    Activate:

    Linux/macOS: source .venv/bin/activate

    Windows: .venv\Scripts\activate

    OR using standard venv

    python -m venv .venv

    Activate:

    Linux/macOS: source .venv/bin/activate

    Windows: .venv\Scripts\activate

    ```

  4. Install dependencies:

    ```bash .venv/Scripts/python.exe -m uv pip install -r requirements.txt

    For contributors/CI

    .venv/Scripts/python.exe -m uv pip install -r requirements-dev.txt ```

  5. Set up API Keys (env-first with file fallback):

  6. OpenRouter: export OPENROUTER_API_KEY="***" or create ~/.api-openrouter with the key only.

  7. Select Model:

  8. Model resolution: echo "openrouter/free" > ~/.model-openrouter If file is absent, default is used.

  9. (Optional) Configure Loop Parameters: You can override default loop parameters using environment variables: bash export MAX_ITERATIONS=5 # Default: 10 export MOCK_DATA_SAMPLES=100 # Default: 50 export BATCH_SIZE=10 # Default: 5 export SLEEP_BETWEEN_BATCHES=2 # Default: 1 (seconds)

Usage

Running the Prompt Refinement Loop

The main script (src/main.py) orchestrates the workflow:

  1. Initialization:

    • Creates necessary directories.
    • Loads the OpenRouter API key.
    • Configures the API client.
    • Loads input phrases from data/raw/mock_input_phrases.tsv or generates mock data if the file doesn't exist.
  2. Prompt Refinement Loop:

    • Starts with the initial generator prompt defined in src/config.py.
    • Iterates for a configured number of times (loop_control.max_iterations).
    • In each iteration (run_prompt_refinement_iteration in src/prompt_loop.py):
      • Processes input phrases in batches.
      • For each phrase:
        • Calls the OpenRouter API using the current generator prompt to generate a paraphrase (generate_paraphrase).
        • If generation succeeds, calls the OpenRouter API using the classification prompt to classify the paraphrase as 'human' or 'machine' (classify_paraphrase).
      • Collects all results (input, generated text, classification).
      • Filters the results to get pairs classified as 'human'.
      • Saves the selected pairs to a TSV file in data/processed/selected/.
      • Calculates and logs summary metrics for the iteration (generation rate, selection rate, etc.). Saves summary to a JSON file in data/processed/.
      • Saves the generator prompt used for the current iteration to data/processed/prompts/.
      • Calls the refine_generator_prompt function (from src/config.py) to potentially modify the generator prompt based on the iteration's results.
      • Uses the (potentially) refined prompt for the next iteration.

To run the loop, execute the following command from the project root directory:

# Ensure your virtual environment is activated
python -m src.main

The script will run for the configured number of iterations, making calls to the OpenRouter API. Monitor your API usage and costs. Stop with Ctrl+C if needed.

Example Usage

  1. Basic Run with Default Settings: bash python -m src.main

  2. Quick Test with Fewer Iterations: bash export MAX_ITERATIONS=2 export MOCK_DATA_SAMPLES=10 python -m src.main

  3. Using Custom Input Data:

  4. Place your TSV file with 'input_text' column in data/raw/
  5. The file should be tab-separated with a header row
  6. Example: data/raw/custom_phrases.tsv

  7. Analyzing Results: After running, check:

  8. data/processed/selected/: Selected human-like paraphrases
  9. data/processed/prompts/: Evolution of generator prompts
  10. data/processed/loop_results_*.json: Metrics per iteration
  11. logs/: Detailed execution logs

Understanding Outputs

  1. Selected Paraphrases (selected_paraphrases_*.tsv)
  2. Contains input phrases and their human-classified paraphrases
  3. Use this data to evaluate prompt effectiveness
  4. Higher selection rates indicate better prompt performance

  5. Loop Results (loop_results_*.json)

  6. total_processed: Total input phrases processed
  7. total_generated: Successfully generated paraphrases
  8. generation_rate: Success rate of API generation calls
  9. selection_rate_of_generated: Percentage of generated text classified as human
  10. total_selected_human: Final count of human-classified paraphrases

  11. Prompt Evolution (generator_prompt_*.txt)

  12. Shows how the generator prompt changes over iterations
  13. Use to understand refinement strategy effectiveness

Security Notes

Testing

The src/tests/ directory contains unit tests. Mocking API calls will be essential for reliable testing without actual API usage.

To run the tests:

# Ensure your virtual environment is activated
python -m pytest src/tests/

Core Components

Troubleshooting

Common Issues

  1. API Key Not Found Error FileNotFoundError: API key file not found at ~/.api-openrouter Solutions:
  2. Ensure your API key file exists: ls -la ~/.api-openrouter
  3. Create the file: echo "your-api-key-here" > ~/.api-openrouter
  4. Or set environment variable: export OPENROUTER_API_KEY="***"

  5. Permission Denied Error PermissionError: Permission denied for API key file Solutions:

  6. Fix file permissions: chmod 600 ~/.api-openrouter
  7. Ensure the file is readable by the current user

  8. API Rate Limiting

  9. Symptom: Getting 429 errors or requests timing out
  10. Solutions:

    • Increase SLEEP_BETWEEN_BATCHES environment variable
    • Reduce BATCH_SIZE to process fewer items at once
    • Check your API provider's rate limits and quota
  11. Empty Generated Text

  12. Symptom: Paraphrases are empty strings
  13. Solutions:

    • Check the generator prompt template in src/config.py
    • Ensure the model has sufficient context to generate meaningful text
    • Try different models via the model files
  14. Classification Always Returns 'machine'

  15. Symptom: All paraphrases are classified as machine-generated
  16. Solutions:
    • Review the classification prompt template
    • Try adjusting the prompt to be more specific about what constitutes "human-like" text
    • Consider using a different model for classification

Debug Mode

Enable detailed logging by setting:

export PYTHONPATH=src
python -m src.main 2>&1 | tee debug.log

Checking API Usage

Monitor your API costs by checking the logs: - Generation calls are logged with input/output details - Classification calls are tracked separately - Total counts are shown in iteration summaries

Resetting the Loop

To start fresh:

rm -rf data/processed/*
rm -f logs/*.log

Further Improvements

Source Code

GitHub repository