Build a Prompt A/B Tester with AI

What You'll Build

Side-by-side prompt comparison interface

1-10 scoring system for both outputs

Test result history with timestamps

Performance analytics and win rates

Export functionality for test data

Error handling and debugging features

What You'll Need

Claude or ChatGPT

We'll use either AI tool to build our A/B testing interface. Both work equally well for this project.

Text Editor

Any text editor or IDE to save and test your HTML file. VS Code, Sublime Text, or even Notepad works fine.

Create the Basic Layout Structure

We'll start by building the main interface with two side-by-side panels for prompt comparison. This combines the workflow concepts from our chaining tutorial with a clean, functional design.

Create a complete HTML page for a Prompt A/B Testing tool with the following structure:

1. Header with title "Prompt A/B Tester" and a subtitle explaining the tool
2. Two side-by-side input sections labeled "Prompt A" and "Prompt B" with large textareas
3. A "Run Test" button centered between them
4. Two output sections below the inputs to display results
5. Scoring controls under each output (1-10 scale with radio buttons)
6. A "Save Results" button and results history section at the bottom
7. Use a dark theme with modern styling
8. Make it responsive for mobile devices
9. Include placeholder text that explains how to use each section

Use semantic HTML and inline CSS. Make it look professional and easy to use.

Look for a complete HTML structure with proper semantic elements, responsive design, and clear visual hierarchy. The AI should create distinct areas for inputs, outputs, scoring, and history.

Add the Testing Workflow Logic

Now we'll implement the JavaScript that handles the A/B testing workflow. This applies our prompt chaining knowledge to create a systematic testing process.

Add JavaScript functionality to the HTML page for the following features:

1. "Run Test" button that simulates sending both prompts to an AI
2. Generate realistic sample outputs for both prompts (different responses)
3. Enable the scoring radio buttons only after a test is run
4. Validate that both prompts are entered before running
5. Clear previous results when starting a new test
6. Show a loading state during the "test" (2-3 second delay)
7. Add error handling for empty prompts with helpful messages
8. Make the interface interactive and responsive to user actions

For the sample outputs, create varied responses that would realistically come from testing different prompts. Include the JavaScript directly in the HTML file using script tags.

The AI should add comprehensive JavaScript that handles user interactions, validates inputs, and simulates the A/B testing process with realistic mock responses.

Build the Scoring and Comparison System

Let's add the scoring mechanism and comparison logic that tracks which prompts perform better over time. This creates a complete evaluation workflow.

Enhance the JavaScript to add a complete scoring and comparison system:

1. Capture scores from both 1-10 radio button sets
2. Calculate which prompt won (higher score) or if it's a tie
3. Show a clear winner indication with visual feedback (colors, icons)
4. "Save Results" button that stores the test data including:
   - Both original prompts (first 50 characters)
   - Both scores
   - Winner determination
   - Timestamp
5. Display saved results in a history table showing all past tests
6. Add summary statistics: total tests, Prompt A wins, Prompt B wins, ties
7. Include a "Clear History" button with confirmation
8. Ensure scores are required before saving results

Make the scoring visual and intuitive. Show the winner clearly with color coding or badges.

Expect robust scoring logic, data persistence in localStorage, a comprehensive history display, and clear visual indicators for test winners and overall performance.

Add Error Handling and Debugging Features

Now we'll implement comprehensive error handling and debugging capabilities, applying what we learned about fixing issues systematically.

Add comprehensive error handling and debugging features:

1. Input validation with specific error messages:
   - Empty prompts
   - Prompts that are too short (less than 10 characters)
   - Missing scores before saving
2. Error display area that shows helpful, specific messages
3. Console logging for debugging:
   - Log when tests start and complete
   - Log scoring actions and data saves
   - Log any errors with context
4. Add a "Debug Mode" toggle that shows:
   - Current application state
   - Last test details
   - LocalStorage contents
5. Try-catch blocks around data operations
6. Graceful handling of localStorage errors
7. Clear error states when starting new actions
8. Add tooltips or help text for unclear interface elements

Make errors user-friendly and actionable. Include developer-friendly console output for troubleshooting.

Look for detailed error handling, informative console logs, user-friendly error messages, and a debug mode that provides insight into the application's internal state.

Enhance with Collaborative Features

Finally, let's add collaborative features inspired by document editing workflows, allowing users to save, share, and export their testing data.

Add collaborative and export features to complete the tool:

1. Export functionality:
   - "Export CSV" button that downloads test history as CSV file
   - "Export JSON" for technical users
   - Include all test data: prompts, scores, winners, timestamps
2. Import functionality:
   - "Import Data" button to load previously exported JSON
   - Merge imported data with existing history
   - Show import success/error messages
3. Sharing features:
   - "Copy Test Link" that encodes current prompts in URL
   - Ability to load shared prompts from URL parameters
   - "Copy Results" button for sharing individual test outcomes
4. Template system:
   - "Save as Template" for frequently used prompts
   - Template dropdown to quickly load saved prompts
   - Template management (delete, rename)
5. Add success notifications for all actions
6. Improve the overall user experience with smooth interactions

Make the export files properly formatted and the import process robust with error handling.

The AI should implement full import/export capabilities, URL-based sharing, a template system, and polished user feedback for all collaborative features.

Common Issues

Scoring buttons not working

Make sure the radio buttons have proper name attributes and the JavaScript is listening for change events correctly.

History not saving

Check that localStorage is available and not blocked. Add try-catch around localStorage operations for better error handling.

Export files not downloading

Ensure the blob creation and download link logic is correct. Test in different browsers as download behavior can vary.

Mobile layout issues

If the side-by-side layout breaks on mobile, ask the AI to improve the responsive design with better breakpoints.

What You Learned

Workflow Design

Built a systematic prompt testing workflow that chains multiple steps together for comprehensive evaluation.

Error Handling

Implemented robust debugging and error handling techniques to make the tool reliable and user-friendly.

Data Management

Created a complete data persistence system with import/export capabilities and collaborative features.

User Experience

Designed an intuitive interface that guides users through complex workflows with clear feedback and validation.

Tips for Going Further

Add real AI integration: Connect to actual AI APIs like OpenAI or Anthropic to test prompts with live responses instead of mock data.

Build prompt libraries: Create categories for different types of prompts (creative writing, code generation, analysis) with specialized scoring criteria.

Add statistical analysis: Include confidence intervals, significance testing, and trend analysis to make your A/B tests more scientifically rigorous.

Create team features: Build user accounts, team sharing, and collaborative testing where multiple people can score the same prompt pairs.

Automate testing: Set up batch testing capabilities where you can run multiple prompt variations against a set of test cases automatically.