Use isolated database environments to test and track AI agent behavior and performance. Create separate database schemas for testing different agent configurations and compare results using natural language queries.

How it works

GibsonAI provides database environments where you can test different agent configurations, track their performance, and analyze behavior patterns. Create isolated databases for each test scenario and use natural language queries to analyze results.

Key Features

Isolated Testing Environments

  • Separate Databases: Create isolated databases for each agent test
  • Independent Schemas: Independent database schemas for different experiments
  • Safe Testing: Test agent behavior without affecting production data
  • Environment Comparison: Compare results across different test environments

Agent Performance Tracking

  • Behavior Logging: Track agent actions and decisions in structured format
  • Performance Metrics: Store and analyze agent performance data
  • Response Tracking: Log agent responses and their effectiveness
  • Error Monitoring: Track errors and failure patterns

Natural Language Analysis

  • Query Testing Results: Use natural language to analyze test results
  • Performance Comparison: Compare agent performance across different scenarios
  • Behavior Analysis: Analyze agent behavior patterns and trends
  • Results Reporting: Generate reports on agent testing outcomes

Use Cases

Agent Development

Perfect for:

  • Testing new agent features and capabilities
  • Validating agent behavior in different scenarios
  • Comparing different agent configurations
  • Debugging agent issues and problems

Performance Optimization

Enable:

  • Identifying performance bottlenecks
  • Testing different optimization strategies
  • Measuring impact of configuration changes
  • Validating performance improvements

Behavior Validation

Support:

  • Ensuring agent responses are appropriate
  • Testing edge cases and error handling
  • Validating decision-making logic
  • Confirming compliance with requirements

Implementation Examples

Setting Up Agent Testing Environment

# Using Gibson CLI to create agent testing database
# Create agent testing tables
# gibson modify agent_tests "Create agent_tests table with id, test_name, agent_config, environment, created_at"
# gibson modify agent_actions "Create agent_actions table with id, test_id, action_type, input_data, output_data, timestamp, duration"
# gibson modify agent_metrics "Create agent_metrics table with id, test_id, metric_name, value, timestamp"
# gibson modify test_results "Create test_results table with id, test_id, result_type, data, success, error_message"

# Generate models and apply changes
# gibson code models
# gibson merge

Agent Testing Framework

import requests
import json
from datetime import datetime
import time

class AgentTester:
    def __init__(self, api_key, environment="test"):
        self.api_key = api_key
        self.environment = environment
        self.base_url = "https://api.gibsonai.com/v1/-"
        self.headers = {"Authorization": f"Bearer {api_key}"}

    def create_test(self, test_name, agent_config):
        """Create a new agent test"""
        test_data = {
            "test_name": test_name,
            "agent_config": agent_config,
            "environment": self.environment,
            "created_at": datetime.now().isoformat()
        }

        response = requests.post(
            f"{self.base_url}/agent-tests",
            json=test_data,
            headers=self.headers
        )

        if response.status_code == 201:
            test_record = response.json()
            print(f"Created test: {test_name}")
            return test_record["id"]
        else:
            print(f"Failed to create test: {response.status_code}")
            return None

    def log_agent_action(self, test_id, action_type, input_data, output_data, duration):
        """Log an agent action during testing"""
        action_data = {
            "test_id": test_id,
            "action_type": action_type,
            "input_data": input_data,
            "output_data": output_data,
            "timestamp": datetime.now().isoformat(),
            "duration": duration
        }

        response = requests.post(
            f"{self.base_url}/agent-actions",
            json=action_data,
            headers=self.headers
        )

        if response.status_code == 201:
            return response.json()
        else:
            print(f"Failed to log action: {response.status_code}")
            return None

    def record_metric(self, test_id, metric_name, value):
        """Record a performance metric"""
        metric_data = {
            "test_id": test_id,
            "metric_name": metric_name,
            "value": value,
            "timestamp": datetime.now().isoformat()
        }

        response = requests.post(
            f"{self.base_url}/agent-metrics",
            json=metric_data,
            headers=self.headers
        )

        if response.status_code == 201:
            return response.json()
        else:
            print(f"Failed to record metric: {response.status_code}")
            return None

    def log_test_result(self, test_id, result_type, data, success, error_message=None):
        """Log test result"""
        result_data = {
            "test_id": test_id,
            "result_type": result_type,
            "data": data,
            "success": success,
            "error_message": error_message
        }

        response = requests.post(
            f"{self.base_url}/test-results",
            json=result_data,
            headers=self.headers
        )

        if response.status_code == 201:
            return response.json()
        else:
            print(f"Failed to log result: {response.status_code}")
            return None

Testing Different Agent Configurations

class AgentBehaviorTester:
    def __init__(self, api_key):
        self.tester = AgentTester(api_key)

    def test_response_configurations(self):
        """Test different agent response configurations"""

        # Test Configuration A: Conservative responses
        config_a = {
            "response_style": "conservative",
            "confidence_threshold": 0.8,
            "escalation_enabled": True
        }

        test_a_id = self.tester.create_test("Conservative Response Test", config_a)

        # Test Configuration B: Assertive responses
        config_b = {
            "response_style": "assertive",
            "confidence_threshold": 0.6,
            "escalation_enabled": False
        }

        test_b_id = self.tester.create_test("Assertive Response Test", config_b)

        # Run tests with same scenarios
        test_scenarios = [
            {"user_input": "I need help with my order", "expected_action": "order_lookup"},
            {"user_input": "I want to cancel my subscription", "expected_action": "cancellation_process"},
            {"user_input": "This product is defective", "expected_action": "refund_process"}
        ]

        for scenario in test_scenarios:
            # Test Configuration A
            self.run_test_scenario(test_a_id, scenario, config_a)

            # Test Configuration B
            self.run_test_scenario(test_b_id, scenario, config_b)

    def run_test_scenario(self, test_id, scenario, config):
        """Run a single test scenario"""
        start_time = time.time()

        # Simulate agent processing
        try:
            # Mock agent response based on configuration
            if config["response_style"] == "conservative":
                response = self.generate_conservative_response(scenario["user_input"])
            else:
                response = self.generate_assertive_response(scenario["user_input"])

            duration = time.time() - start_time

            # Log the action
            self.tester.log_agent_action(
                test_id,
                "user_interaction",
                scenario["user_input"],
                response,
                duration
            )

            # Record metrics
            self.tester.record_metric(test_id, "response_time", duration)
            self.tester.record_metric(test_id, "confidence_score", response.get("confidence", 0))

            # Log result
            success = response.get("action") == scenario["expected_action"]
            self.tester.log_test_result(
                test_id,
                "scenario_test",
                {"scenario": scenario, "response": response},
                success
            )

        except Exception as e:
            # Log error
            self.tester.log_test_result(
                test_id,
                "scenario_test",
                {"scenario": scenario, "error": str(e)},
                False,
                str(e)
            )

    def generate_conservative_response(self, user_input):
        """Generate conservative agent response"""
        # Mock conservative response logic
        return {
            "response": "I'd be happy to help you with that. Let me connect you with a specialist.",
            "action": "escalate",
            "confidence": 0.9
        }

    def generate_assertive_response(self, user_input):
        """Generate assertive agent response"""
        # Mock assertive response logic
        return {
            "response": "I can help you with that right away. Let me process your request.",
            "action": "direct_action",
            "confidence": 0.7
        }

Analyzing Test Results

class TestResultAnalyzer:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.gibsonai.com/v1/-"
        self.headers = {"Authorization": f"Bearer {api_key}"}

    def compare_test_performance(self, test_a_name, test_b_name):
        """Compare performance between two tests"""
        query_request = {
            "query": f"Compare average response time and success rate between tests named '{test_a_name}' and '{test_b_name}'"
        }

        response = requests.post(
            f"{self.base_url}/query",
            json=query_request,
            headers=self.headers
        )

        if response.status_code == 200:
            results = response.json()
            print("Test Performance Comparison:")
            for result in results:
                print(f"  {result}")
            return results
        else:
            print(f"Analysis failed: {response.status_code}")
            return None

    def analyze_agent_behavior_patterns(self, test_id):
        """Analyze agent behavior patterns in a test"""
        query_request = {
            "query": f"Analyze action types and response patterns for test ID {test_id}"
        }

        response = requests.post(
            f"{self.base_url}/query",
            json=query_request,
            headers=self.headers
        )

        if response.status_code == 200:
            results = response.json()
            print(f"Behavior Analysis for Test {test_id}:")
            for result in results:
                print(f"  {result}")
            return results
        else:
            print(f"Analysis failed: {response.status_code}")
            return None

    def get_error_analysis(self, test_id):
        """Get error analysis for a test"""
        query_request = {
            "query": f"Show all errors and failure patterns for test ID {test_id}"
        }

        response = requests.post(
            f"{self.base_url}/query",
            json=query_request,
            headers=self.headers
        )

        if response.status_code == 200:
            results = response.json()
            print(f"Error Analysis for Test {test_id}:")
            for result in results:
                print(f"  {result}")
            return results
        else:
            print(f"Analysis failed: {response.status_code}")
            return None

    def generate_test_report(self, test_name):
        """Generate comprehensive test report"""
        query_request = {
            "query": f"Generate a comprehensive report for test '{test_name}' including performance metrics, success rates, and error analysis"
        }

        response = requests.post(
            f"{self.base_url}/query",
            json=query_request,
            headers=self.headers
        )

        if response.status_code == 200:
            results = response.json()
            print(f"Test Report for {test_name}:")
            for result in results:
                print(f"  {result}")
            return results
        else:
            print(f"Report generation failed: {response.status_code}")
            return None

A/B Testing Example

class ABTestingFramework:
    def __init__(self, api_key):
        self.tester = AgentTester(api_key)
        self.analyzer = TestResultAnalyzer(api_key)

    def run_ab_test(self, test_name, config_a, config_b, scenarios):
        """Run A/B test with two configurations"""

        # Create tests for both configurations
        test_a_id = self.tester.create_test(f"{test_name}_A", config_a)
        test_b_id = self.tester.create_test(f"{test_name}_B", config_b)

        # Run scenarios for both configurations
        for scenario in scenarios:
            # Test Configuration A
            self.run_scenario_test(test_a_id, scenario, config_a)

            # Test Configuration B
            self.run_scenario_test(test_b_id, scenario, config_b)

        # Analyze results
        print(f"\nA/B Test Results for {test_name}:")
        self.analyzer.compare_test_performance(f"{test_name}_A", f"{test_name}_B")

        return test_a_id, test_b_id

    def run_scenario_test(self, test_id, scenario, config):
        """Run a single scenario test"""
        start_time = time.time()

        try:
            # Simulate agent processing based on configuration
            response = self.simulate_agent_response(scenario, config)
            duration = time.time() - start_time

            # Log action
            self.tester.log_agent_action(
                test_id,
                "scenario_test",
                scenario,
                response,
                duration
            )

            # Record metrics
            self.tester.record_metric(test_id, "response_time", duration)
            self.tester.record_metric(test_id, "confidence_score", response.get("confidence", 0))

            # Determine success
            success = response.get("error") is None

            # Log result
            self.tester.log_test_result(
                test_id,
                "ab_test_scenario",
                {"scenario": scenario, "response": response},
                success,
                response.get("error")
            )

        except Exception as e:
            # Log error
            self.tester.log_test_result(
                test_id,
                "ab_test_scenario",
                {"scenario": scenario, "error": str(e)},
                False,
                str(e)
            )

    def simulate_agent_response(self, scenario, config):
        """Simulate agent response based on configuration"""
        # Mock agent response logic
        if config.get("response_style") == "detailed":
            return {
                "response": "I'll provide detailed help with your request...",
                "confidence": 0.85,
                "action": "detailed_response"
            }
        else:
            return {
                "response": "I can help with that.",
                "confidence": 0.75,
                "action": "brief_response"
            }

Benefits for AI Agent Testing

Comprehensive Testing

  • Isolated Environments: Test different configurations without interference
  • Structured Data: Organized test data for easy analysis
  • Natural Language Analysis: Query test results using natural language
  • Performance Tracking: Track agent performance over time

Data-Driven Insights

  • Behavior Analysis: Analyze agent behavior patterns and trends
  • Performance Comparison: Compare different agent configurations
  • Error Identification: Identify and analyze error patterns
  • Optimization Guidance: Data-driven insights for agent improvement

Scalable Testing

  • Multiple Environments: Test multiple configurations simultaneously
  • Flexible Schema: Adapt database schema to different testing needs
  • API Integration: Easy integration with existing testing workflows
  • Automated Analysis: Automated analysis and reporting capabilities

Best Practices

Test Design

  • Clear Objectives: Define clear testing objectives and success criteria
  • Realistic Scenarios: Use realistic test scenarios that match production usage
  • Controlled Variables: Control variables to isolate the impact of changes
  • Comprehensive Coverage: Test edge cases and error scenarios

Data Management

  • Consistent Logging: Log all relevant data consistently across tests
  • Data Quality: Ensure high-quality test data for accurate analysis
  • Version Control: Track changes to test configurations and scenarios
  • Data Retention: Implement appropriate data retention policies

Analysis and Reporting

  • Regular Analysis: Regularly analyze test results for insights
  • Comparative Analysis: Compare results across different configurations
  • Trend Analysis: Track performance trends over time
  • Actionable Insights: Focus on actionable insights for improvement

Getting Started

  1. Design Test Schema: Define your agent testing database schema
  2. Create Test Environment: Set up isolated database for testing
  3. Implement Testing Framework: Create framework for logging test data
  4. Run Tests: Execute tests with different agent configurations
  5. Analyze Results: Use natural language queries to analyze results

Gibson CLI Commands

# Create agent testing schema
gibson modify table_name "description of testing table"
gibson code models
gibson merge

# Generate models for testing integration
gibson code models
gibson code schemas

# Reset testing environment
gibson forget last
gibson build datastore

Ready to set up database environments for testing AI agent behavior? Get started with GibsonAI.