# Phase 1: User Validation - Implementation Guide

## Overview

This is a **hybrid approach** implementation for validating electricity company customers via phone call. The system uses:

- **Python State Machine**: Controls conversation flow deterministically
- **LLM (via vLLM)**: Extracts data (phone, name) and generates natural Vietnamese responses
- **VieNeu TTS**: Synthesizes Vietnamese speech
- **Faster-ASR**: Transcribes Vietnamese speech

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Voice Agent System                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────────┐    ┌─────────────┐       │
│  │   ASR    │───▶│  State       │───▶│    LLM      │       │
│  │ (Speech  │    │  Machine     │    │  Functions  │       │
│  │ to Text) │    │  (Python)    │    │             │       │
│  └──────────┘    └──────────────┘    └─────────────┘       │
│                          │                    │              │
│                          ▼                    ▼              │
│                  ┌──────────────┐    ┌─────────────┐        │
│                  │  Customer    │    │  Response   │        │
│                  │  Database    │    │  Generator  │        │
│                  └──────────────┘    └─────────────┘        │
│                                              │               │
│                                              ▼               │
│                                      ┌─────────────┐         │
│                                      │    TTS      │         │
│                                      │  (VieNeu)   │         │
│                                      └─────────────┘         │
└─────────────────────────────────────────────────────────────┘
```

## State Flow

```
┌──────────────┐
│   GREETING   │ "Xin chào, đây là tổng đài điện lực..."
└──────┬───────┘
       │
       ▼
┌────────────────────────┐
│ AWAIT_PHONE_REQUEST    │ "Xin cho biết số điện thoại..."
└──────┬─────────────────┘
       │ [LLM extracts phone]
       ▼
┌────────────────────────┐
│  COLLECTING_PHONE      │ "Số điện thoại là 0901234567, đúng không?"
└──────┬─────────────────┘
       │ [User confirms/rejects]
       ▼
┌────────────────────────┐
│ AWAIT_NAME_REQUEST     │ "Xin cho biết họ và tên..."
└──────┬─────────────────┘
       │ [LLM extracts name]
       ▼
┌────────────────────────┐
│  COLLECTING_NAME       │ "Tên là Nguyễn Văn An, đúng không?"
└──────┬─────────────────┘
       │ [User confirms/rejects]
       ▼
┌────────────────────────┐
│  VALIDATING_USER       │ [Python checks database]
└──────┬─────────────────┘
       │
       ▼
┌────────────────────────┐
│ VALIDATION_COMPLETE    │ 
│                        │
│ Outcomes:              │
│ • Existing (verified)  │ "Xin chào anh/chị [Name]..."
│ • Existing (mismatch)  │ "Số này đăng ký với tên [DB Name]..."
│ • New customer         │ "Chưa có thông tin, muốn đăng ký không?"
└────────────────────────┘
```

## Key Components

### 1. ValidationStateMachine

**Responsibility**: Control state transitions (deterministic)

```python
class ValidationStateMachine:
    states = [
        GREETING,
        AWAIT_PHONE_REQUEST,
        COLLECTING_PHONE,
        AWAIT_NAME_REQUEST,
        COLLECTING_NAME,
        VALIDATING_USER,
        VALIDATION_COMPLETE
    ]
```

**Key methods**:
- `start()` - Begin validation flow
- `process(user_speech)` - Process user input, transition states
- `is_complete()` - Check if validation done

### 2. LLMFunctions

**Responsibility**: Data extraction and response generation

**Functions**:
- `extract_phone_number(speech)` → ExtractionResult
  - Returns: phone, confidence, needs_confirmation
  - Handles: speech artifacts, hesitation, multiple numbers
  
- `extract_name(speech)` → ExtractionResult
  - Returns: name, confidence, needs_confirmation
  - Handles: Vietnamese names with diacritics, titles
  
- `generate_response(state, context)` → str
  - Generates natural Vietnamese responses per state
  - Handles: retry scenarios, confirmations

### 3. CustomerDatabase

**Responsibility**: Customer lookup and name matching

```python
class CustomerDatabase:
    def lookup(phone: str) -> Optional[Customer]
    def fuzzy_match_name(input_name: str, db_name: str) -> float
```

**Name matching**: Handles Vietnamese diacritics, spacing variations

### 4. VoiceAgentWithValidation

**Responsibility**: Integrate all components

- ASR (speech → text)
- State machine (process flow)
- TTS (text → speech)
- Database (validation)

## Installation

### Prerequisites

```bash
# 1. Install dependencies
pip install websockets loguru python-dotenv openai scipy numpy

# 2. Install ASR
# (your faster_asr module)

# 3. Install TTS
pip install vieneu

# 4. Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model ./models/Qwen/Qwen2.5-7B-Instruct \
    --port 8000
```

### Configuration

Create `.env` file:

```bash
# vLLM
VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL=./models/Qwen/Qwen2.5-7B-Instruct

# VieNeu TTS
VIENEU_MODEL_DIR=vieneu-0.3B
# VIENEU_VOICE_ID=ngoc_huyen  # Optional: specific voice

# Audio
SAMPLE_RATE=16000
CHUNK_DURATION=5

# LLM Settings
MAX_TOKENS=150
TEMPERATURE=0.3

# Validation
MAX_RETRY_PHONE=3
MAX_RETRY_NAME=3
PHONE_CONFIDENCE_THRESHOLD=0.7
NAME_CONFIDENCE_THRESHOLD=0.6

# Database
CUSTOMER_DB_FILE=./data/customers.csv

# WebSocket
WS_HOST=0.0.0.0
WS_PORT=8765
```

### Directory Structure

```
.
├── voice_agent_validation.py  # Main implementation
├── data/
│   └── customers.csv          # Customer database
├── vieneu-0.3B/               # TTS model
├── model_ct2_fp16/            # ASR model
└── .env                       # Configuration
```

## Running

```bash
# Start the agent
python voice_agent_validation.py
```

Expected output:
```
======================================================================
🎙️  ELECTRICITY CALL CENTER - PHASE 1: VALIDATION
======================================================================
Language:     Vietnamese
LLM Server:   http://localhost:8000/v1
TTS Model:    VieNeu (vieneu-0.3B)
Customer DB:  ./data/customers.csv
WebSocket:    ws://0.0.0.0:8765
======================================================================

✅ ASR loaded
✅ LLM client connected
✅ VieNeu loaded with default voice
✅ Loaded 10 customers from database
✅ Voice agent with validation initialized
✅ Voice agent listening on ws://0.0.0.0:8765
Press Ctrl+C to stop
```

## Testing Scenarios

### Scenario 1: Existing Customer (Verified)

**Flow**:
```
Agent: "Xin chào, đây là tổng đài điện lực. Tôi có thể giúp gì cho quý khách?"

Agent: "Để phục vụ quý khách tốt hơn, xin cho biết số điện thoại của quý khách ạ?"
User: "Số tôi là 0901234567"

Agent: "Số điện thoại của quý khách là 0901234567. Đúng không ạ?"
User: "Đúng"

Agent: "Xin cho biết họ và tên của quý khách ạ?"
User: "Tôi là Nguyễn Văn An"

Agent: "Tên quý khách là Nguyễn Văn An. Đúng không ạ?"
User: "Vâng"

Agent: "Xin chào anh/chị Nguyễn Văn An. Hệ thống đã xác nhận thông tin. Quý khách cần hỗ trợ gì ạ?"
```

**Result**: ✅ Existing customer verified (account: KH001)

### Scenario 2: Existing Customer (Name Mismatch)

**Flow**:
```
User: "Số 0901234567"
Agent: "Đúng không?"
User: "Đúng"

Agent: "Xin cho biết họ và tên?"
User: "Tôi là Trần Văn Bình"  [Database has: Nguyễn Văn An]

Agent: "Số điện thoại này đã đăng ký với tên Nguyễn Văn An. Quý khách có phải là người đăng ký không ạ?"
```

**Result**: ⚠️ Existing customer with name mismatch (requires clarification)

### Scenario 3: New Customer

**Flow**:
```
User: "0999888777"  [Not in database]
Agent: "Đúng không?"
User: "Đúng"

Agent: "Xin tên?"
User: "Nguyễn Thị Mai"

Agent: "Hệ thống chưa có thông tin của quý khách. Quý khách muốn đăng ký dịch vụ mới hay tra cứu thông tin ạ?"
```

**Result**: 🆕 New customer (ready for registration)

### Scenario 4: Speech Issues (Retry Logic)

**Flow**:
```
Agent: "Xin số điện thoại?"
User: "Uh... tôi... ờ... không nhớ"

[LLM extraction fails]

Agent: "Xin lỗi, tôi chưa nghe rõ số điện thoại. Quý khách vui lòng nói lại số điện thoại ạ?"
User: "Cho tôi xem... 0901234567"

[Success on retry]
```

**Result**: ✅ Recovered after retry

### Scenario 5: Escalation to Human

**Flow**:
```
Agent: "Xin số điện thoại?"
User: [garbled speech]

Agent: "Xin lỗi, chưa nghe rõ. Nói lại?"
User: [garbled speech]

Agent: "Vui lòng nói từng số?"
User: [still unclear]

Agent: "Xin lỗi quý khách. Để được hỗ trợ tốt hơn, tôi sẽ chuyển máy cho nhân viên."
```

**Result**: ⚡ Escalated after 3 failed attempts

## Hybrid Approach Benefits

### What Python Controls (Deterministic)

✅ State transitions (GREETING → AWAIT_PHONE → ...)
✅ Retry logic (max 3 attempts)
✅ Database validation (lookup, fuzzy match)
✅ Error handling and escalation
✅ Context tracking

### What LLM Handles (Flexible)

✅ Extract phone from messy speech ("uh... 090... 0901234567")
✅ Extract Vietnamese names with diacritics
✅ Handle speech artifacts and hesitations
✅ Generate natural Vietnamese responses
✅ Adapt to user's speaking style

### Why This Works

| Challenge | Solution |
|-----------|----------|
| **LLM might skip states** | Python enforces sequence |
| **User speaks unclearly** | LLM extracts, Python retries |
| **Need exact validation** | Python does DB lookup (no hallucination) |
| **Vietnamese name matching** | Python fuzzy match (controllable) |
| **Natural conversation** | LLM generates responses |
| **Debugging** | Clear state logs in Python |

## Extending to Phase 2

After validation completes:

```python
if validation_sm.is_complete():
    context = validation_sm.get_context()
    
    # Transition to intent detection
    intent_detector = IntentDetector(llm_client)
    intent = await intent_detector.detect(user_speech, context)
    
    if intent == "BILLING_INQUIRY":
        handler = BillingHandler(context)
        await handler.process(user_speech)
    # ... other intents
```

## Logging

The system logs state transitions and context:

```
📍 State: AWAIT_PHONE_REQUEST | User: số tôi là 0901234567
✅ Extracted phone: 0901234567 (confidence: 0.95)
📍 State: AWAIT_NAME_REQUEST | User: tôi là Nguyễn Văn An
✅ Extracted name: Nguyễn Văn An (confidence: 0.92)
🔍 Validating: 0901234567 | Nguyễn Văn An
📊 Name similarity: 0.98
✅ Customer verified: KH001

📊 Context:
{
  "phone": "0901234567",
  "phone_confirmed": true,
  "name": "Nguyễn Văn An",
  "name_confirmed": true,
  "status": "existing_verified",
  "retries": {
    "phone": 0,
    "name": 0
  }
}
```

## Customization

### Add Custom Voices

```python
# In Config
VIENEU_VOICE_ID = "my_custom_voice"

# TTS will use this voice for all responses
```

### Adjust Confidence Thresholds

```python
# In Config
PHONE_CONFIDENCE_THRESHOLD = 0.8  # Higher = more confirmations
NAME_CONFIDENCE_THRESHOLD = 0.7
```

### Modify Retry Limits

```python
# In Config
MAX_RETRY_PHONE = 5  # More patient with phone number
MAX_RETRY_NAME = 2   # Less patient with name
```

### Change Database Format

```python
class CustomerDatabase:
    def _load_database(self):
        # Replace CSV with SQL, MongoDB, etc.
        # Just implement lookup() method
```

## Troubleshooting

### Issue: LLM extraction always fails

**Check**:
- vLLM server is running: `curl http://localhost:8000/v1/models`
- Model path is correct in `.env`
- LLM supports Vietnamese (Qwen models work well)

### Issue: Name matching too strict/loose

**Adjust**:
```python
# In CustomerDatabase.fuzzy_match_name()
if similarity > 0.9:  # Change threshold (currently 0.8)
    self.context.customer_status = "existing_verified"
```

### Issue: TTS voice quality poor

**Try**:
- Different VieNeu voice: Set `VIENEU_VOICE_ID` in `.env`
- Check sample rate matches: 22050 Hz → 16000 Hz conversion

### Issue: ASR transcribes incorrectly

**Check**:
- Audio quality (sample rate, bit depth)
- VAD settings (might cut off speech)
- Language setting: `language="vi"`

## Performance

Typical latencies (on RTX 3090):

- ASR transcription: 0.5-1s (5s audio)
- LLM extraction: 0.3-0.5s
- Database lookup: <0.01s
- TTS synthesis: 0.5-1s
- **Total round-trip**: ~2-3s

## Next Steps

1. ✅ **Phase 1 Complete**: User validation
2. 🔄 **Phase 2**: Intent detection (5 intents)
3. 🔄 **Phase 3**: Per-intent handlers (billing, payment, outage, etc.)
4. 🔄 **Phase 4**: Multi-intent routing
5. 🔄 **Phase 5**: Production deployment (load balancing, monitoring)

## Contact

For questions about this implementation, refer to the design docs in `DESIGN.md`.
