Troubleshooting Guide
Common Issues and Solutions
Installation and Setup Issues
Database Connection Problems
Symptom: Application fails to start with database connection errors
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server
Solutions: 1. Check Database Service ```bash # Check PostgreSQL status sudo systemctl status postgresql
# Start PostgreSQL if stopped sudo systemctl start postgresql ```
- Verify Connection String ```bash # Test connection manually psql -h localhost -U race_user -d race_console
# Check environment variable echo $DATABASE_URL ```
- Check Database Permissions ```sql -- Connect as postgres user sudo -u postgres psql
-- Check user permissions \du race_user
-- Grant necessary permissions GRANT ALL PRIVILEGES ON DATABASE race_console TO race_user; ```
- Network Configuration ```bash # Check if database is listening netstat -ln | grep 5432
# Check firewall rules sudo ufw status ```
Python Dependencies Issues
Symptom: Import errors or missing module errors
ModuleNotFoundError: No module named 'flask'
Solutions: 1. Verify Virtual Environment ```bash # Check if virtual environment is activated which python
# Activate virtual environment source venv/bin/activate ```
- Reinstall Dependencies ```bash # Clear pip cache pip cache purge
# Reinstall requirements pip install -r requirements.txt --force-reinstall ```
- Python Version Issues ```bash # Check Python version python --version
# Create new virtual environment with specific Python version python3.11 -m venv venv_new ```
Configuration Issues
API Configuration Problems
Symptom: CONNECT API authentication failures
requests.exceptions.HTTPError: 401 Client Error: Unauthorized
Solutions:
1. Verify Credentials
bash
# Test credentials with curl
curl -X POST "https://datahub.connect.aveva.com/api/v1/tenants/YOUR_TENANT/auth/clientcredentials" \
-H "Content-Type: application/json" \
-d '{"client_id": "YOUR_CLIENT_ID", "client_secret": "YOUR_CLIENT_SECRET"}'
- Check Region Settings
- US:
https://datahub.connect.aveva.com - EU:
https://euno.datahub.connect.aveva.com -
AP:
https://apac.datahub.connect.aveva.com -
Validate Configuration
python # In Flask shell from services.aveva_client import AVEVAClient client = AVEVAClient() client.test_connection()
Environment Variable Issues
Symptom: Configuration values not loading correctly
Solutions:
1. Check Environment File
bash
# Verify .env file exists and is readable
ls -la .env
cat .env
-
Load Environment Variables
bash # Manual loading for testing source .env echo $FLASK_SECRET_KEY -
Environment Variable Priority
python # Check loaded configuration from app import app print(app.config['SECRET_KEY'])
Runtime Issues
High Memory Usage
Symptom: Application consuming excessive memory
Solutions: 1. Monitor Memory Usage ```bash # Check process memory ps aux | grep gunicorn
# Monitor memory over time top -p $(pgrep -f gunicorn) ```
-
Optimize Database Connections
python # Reduce connection pool size SQLALCHEMY_ENGINE_OPTIONS = { 'pool_size': 5, 'max_overflow': 10, 'pool_recycle': 3600 } -
Limit Background Jobs
python # Reduce concurrent monitoring MAX_CONCURRENT_STREAMS = 20 MONITORING_INTERVAL = 60 # Increase interval
Performance Issues
Symptom: Slow page loads and API responses
Solutions: 1. Database Query Optimization ```sql -- Add missing indexes CREATE INDEX idx_rule_events_active_time ON rule_events(is_active, start_time); CREATE INDEX idx_monitored_streams_name ON monitored_streams(stream_name);
-- Analyze query performance EXPLAIN ANALYZE SELECT * FROM rule_events WHERE is_active = true; ```
- Enable Query Caching ```python # Add caching to frequent queries from flask_caching import Cache cache = Cache(app)
@cache.memoize(timeout=300) def get_active_events(): return RuleEvent.query.filter_by(is_active=True).all() ```
- Optimize Background Processing
python # Reduce monitoring frequency for less critical streams MONITORING_INTERVALS = { 'critical': 15, 'normal': 30, 'low_priority': 60 }
AI Provider Issues
OpenAI Connection Problems
Symptom: OpenAI API calls failing
openai.error.AuthenticationError: Incorrect API key provided
Solutions:
1. Verify API Key
python
# Test API key
import openai
openai.api_key = "your-api-key"
openai.Model.list()
- Check Rate Limits ```python # Implement retry logic with exponential backoff import time from openai.error import RateLimitError
try: response = openai.ChatCompletion.create(...) except RateLimitError: time.sleep(60) # Wait before retry ```
- Monitor Usage
bash # Check API usage logs tail -f /var/log/race-console/api_usage.log
Function Calling Issues
Symptom: AI function calls not executing properly
Solutions:
1. Validate Function Schemas
python
# Check function definition format
def validate_function_schema(schema):
required_keys = ['name', 'description', 'parameters']
return all(key in schema for key in required_keys)
- Debug Function Execution ```python # Add detailed logging import logging logger = logging.getLogger(name)
def execute_function(function_name, arguments): logger.info(f"Executing function: {function_name} with args: {arguments}") try: result = function_registryfunction_name logger.info(f"Function result: {result}") return result except Exception as e: logger.error(f"Function execution failed: {e}") raise ```
- Check Context Data
python # Verify context data availability from services.context_extractor import ContextExtractor extractor = ContextExtractor() context = extractor.get_context_data('events', '24h', 50) print(f"Context data size: {len(context)}")
Monitoring and Events Issues
Stream Monitoring Failures
Symptom: Streams showing as inactive or not updating
Solutions:
1. Check Stream Configuration
python
# Verify stream exists in CONNECT
from services.aveva_client import AVEVAClient
client = AVEVAClient()
streams = client.discover_streams()
print([s for s in streams if 'StreamName' in s])
- Monitor Background Jobs ```bash # Check scheduler logs tail -f /var/log/race-console/scheduler.log
# Check active jobs python -c " from app import app from services.monitoring_engine import monitoring_engine with app.app_context(): print(monitoring_engine.get_job_status()) " ```
- Network Connectivity ```bash # Test CONNECT API connectivity curl -v https://datahub.connect.aveva.com/api/v1/health
# Check DNS resolution nslookup datahub.connect.aveva.com ```
Rule Evaluation Problems
Symptom: Rules not triggering or triggering incorrectly
Solutions: 1. Debug Rule Logic ```python # Test rule conditions manually from services.rule_engine import RuleEngine engine = RuleEngine()
condition = {'attribute': 'Status', 'operator': 'equals', 'value': 'Running'} result = engine._evaluate_condition(condition, 'Running') print(f"Condition result: {result}") ```
- Check Placeholder Mappings ```python # Verify placeholder resolution from services.placeholder_resolver import PlaceholderResolver resolver = PlaceholderResolver()
instance_id = 1 mappings = resolver.get_placeholder_mappings(instance_id) print(f"Placeholder mappings: {mappings}") ```
- Validate Stream Data ```python # Check stream data format from services.monitoring_engine import MonitoringEngine engine = MonitoringEngine()
stream_name = "Wonderbrew.Roaster022.Status" value = engine.get_stream_value(stream_name) print(f"Stream {stream_name} value: {value}") ```
UI and Frontend Issues
JavaScript Errors
Symptom: UI components not working, console errors
Solutions:
1. Check Browser Console
javascript
// Common error: Feather icons not loading
// Solution: Ensure feather.replace() is called after DOM updates
setTimeout(() => {
if (typeof feather !== 'undefined') {
feather.replace();
}
}, 100);
- Verify Static File Loading ```bash # Check static file permissions ls -la static/js/ ls -la static/css/
# Test static file access curl -I http://localhost:5000/static/js/main.js ```
- Browser Compatibility
javascript // Check for modern browser features if (!window.fetch) { console.error('Browser does not support fetch API'); }
Chart and Visualization Issues
Symptom: Charts not rendering or displaying incorrectly
Solutions:
1. Check Chart.js Loading
javascript
// Verify Chart.js availability
if (typeof Chart === 'undefined') {
console.error('Chart.js not loaded');
}
-
Data Format Validation
javascript // Validate chart data format function validateChartData(data) { return data && data.labels && data.datasets && Array.isArray(data.labels) && Array.isArray(data.datasets); } -
Canvas Context Issues
javascript // Ensure canvas element exists const canvas = document.getElementById('chart-canvas'); if (!canvas) { console.error('Chart canvas element not found'); return; }
Security Issues
Session Management Problems
Symptom: Users getting logged out frequently or session errors
Solutions:
1. Check Session Configuration
python
# Verify session settings
app.config['PERMANENT_SESSION_LIFETIME'] = timedelta(hours=24)
app.config['SESSION_COOKIE_SECURE'] = True # Only for HTTPS
app.config['SESSION_COOKIE_HTTPONLY'] = True
-
Secret Key Issues
python # Ensure secret key is consistent # Don't change SECRET_KEY in production as it invalidates sessions print(len(app.config['SECRET_KEY'])) # Should be > 32 characters -
Cookie Domain Issues
python # Set correct cookie domain for subdomains app.config['SESSION_COOKIE_DOMAIN'] = '.yourdomain.com'
HTTPS and SSL Issues
Symptom: SSL certificate errors or mixed content warnings
Solutions: 1. Check Certificate Validity ```bash # Test SSL certificate openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
# Check certificate expiration openssl x509 -in certificate.crt -text -noout | grep "Not After" ```
-
Force HTTPS Redirect
python # Ensure HTTPS redirect is working @app.before_request def force_https(): if not request.is_secure and app.config.get('FORCE_HTTPS'): return redirect(request.url.replace('http://', 'https://')) -
Mixed Content Issues ```html
```
Diagnostic Tools
Health Check Endpoint
Create a comprehensive health check:
@app.route('/health')
def health_check():
"""Comprehensive health check endpoint"""
health_status = {
'status': 'healthy',
'timestamp': datetime.utcnow().isoformat(),
'checks': {}
}
# Database check
try:
db.session.execute('SELECT 1')
health_status['checks']['database'] = 'healthy'
except Exception as e:
health_status['checks']['database'] = f'unhealthy: {str(e)}'
health_status['status'] = 'unhealthy'
# CONNECT API check
try:
from services.aveva_client import AVEVAClient
client = AVEVAClient()
client.test_connection()
health_status['checks']['connect_api'] = 'healthy'
except Exception as e:
health_status['checks']['connect_api'] = f'unhealthy: {str(e)}'
health_status['status'] = 'degraded'
# Background jobs check
try:
from services.monitoring_engine import monitoring_engine
if monitoring_engine.is_running():
health_status['checks']['background_jobs'] = 'healthy'
else:
health_status['checks']['background_jobs'] = 'stopped'
health_status['status'] = 'degraded'
except Exception as e:
health_status['checks']['background_jobs'] = f'unhealthy: {str(e)}'
status_code = 200 if health_status['status'] == 'healthy' else 503
return jsonify(health_status), status_code
Log Analysis Script
#!/bin/bash
# scripts/analyze_logs.sh
echo "=== RACE Console Log Analysis ==="
# Check application logs for errors
echo "Recent Errors:"
grep -i error /var/log/race-console/app.log | tail -10
# Check database connection issues
echo -e "\nDatabase Issues:"
grep -i "database\|connection" /var/log/race-console/app.log | tail -5
# Check API call failures
echo -e "\nAPI Failures:"
grep -i "api.*error\|failed.*request" /var/log/race-console/app.log | tail -5
# Check memory usage
echo -e "\nMemory Usage:"
ps aux | grep -E "(gunicorn|python)" | awk '{print $6, $11}' | sort -nr | head -5
# Check disk space
echo -e "\nDisk Usage:"
df -h | grep -E "(Filesystem|/dev/)"
Database Maintenance Script
#!/usr/bin/env python3
# scripts/db_maintenance.py
from app import app, db
from models import RuleEvent, ConversationSession, MonitoringEvent
from datetime import datetime, timedelta
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def cleanup_old_data():
"""Clean up old data to maintain performance"""
with app.app_context():
# Clean up old monitoring events (older than 30 days)
cutoff_date = datetime.utcnow() - timedelta(days=30)
old_events = MonitoringEvent.query.filter(
MonitoringEvent.timestamp < cutoff_date
).count()
if old_events > 0:
MonitoringEvent.query.filter(
MonitoringEvent.timestamp < cutoff_date
).delete()
logger.info(f"Deleted {old_events} old monitoring events")
# Clean up old conversation sessions (older than 90 days)
session_cutoff = datetime.utcnow() - timedelta(days=90)
old_sessions = ConversationSession.query.filter(
ConversationSession.started_at < session_cutoff
).count()
if old_sessions > 0:
ConversationSession.query.filter(
ConversationSession.started_at < session_cutoff
).delete()
logger.info(f"Deleted {old_sessions} old conversation sessions")
db.session.commit()
logger.info("Database cleanup completed")
if __name__ == '__main__':
cleanup_old_data()
Performance Monitoring Script
#!/usr/bin/env python3
# scripts/performance_monitor.py
import psutil
import time
import json
from datetime import datetime
def collect_metrics():
"""Collect system performance metrics"""
metrics = {
'timestamp': datetime.utcnow().isoformat(),
'cpu_percent': psutil.cpu_percent(interval=1),
'memory_percent': psutil.virtual_memory().percent,
'disk_usage': psutil.disk_usage('/').percent,
'load_average': psutil.getloadavg(),
'process_count': len(psutil.pids())
}
# Get process-specific metrics
for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_percent']):
if 'gunicorn' in proc.info['name'] or 'python' in proc.info['name']:
metrics[f"process_{proc.info['pid']}"] = {
'name': proc.info['name'],
'cpu_percent': proc.info['cpu_percent'],
'memory_percent': proc.info['memory_percent']
}
return metrics
def main():
"""Main monitoring loop"""
while True:
metrics = collect_metrics()
# Log metrics
with open('/var/log/race-console/performance.log', 'a') as f:
f.write(json.dumps(metrics) + '\n')
# Alert on high resource usage
if metrics['cpu_percent'] > 80:
print(f"HIGH CPU USAGE: {metrics['cpu_percent']}%")
if metrics['memory_percent'] > 85:
print(f"HIGH MEMORY USAGE: {metrics['memory_percent']}%")
time.sleep(60) # Collect metrics every minute
if __name__ == '__main__':
main()
Emergency Procedures
Service Recovery
Quick Service Restart
#!/bin/bash
# scripts/emergency_restart.sh
echo "Starting emergency service restart..."
# Stop services
sudo systemctl stop race-console
sudo systemctl stop nginx
# Check for any remaining processes
pkill -f gunicorn
pkill -f "python.*main.py"
# Clear any locks
rm -f /var/run/race-console/*.pid
# Start services
sudo systemctl start race-console
sudo systemctl start nginx
# Check status
sleep 5
sudo systemctl status race-console
sudo systemctl status nginx
echo "Emergency restart completed"
Database Recovery
#!/bin/bash
# scripts/database_recovery.sh
echo "Starting database recovery..."
# Stop application
sudo systemctl stop race-console
# Restart PostgreSQL
sudo systemctl restart postgresql
# Check database integrity
sudo -u postgres psql -d race_console -c "SELECT pg_database_size('race_console');"
# Vacuum and analyze
sudo -u postgres psql -d race_console -c "VACUUM ANALYZE;"
# Start application
sudo systemctl start race-console
echo "Database recovery completed"
Rollback Procedures
Configuration Rollback
#!/bin/bash
# scripts/rollback_config.sh
BACKUP_DIR="/backups/config"
DATE=$(date +%Y%m%d)
# Restore configuration from backup
if [ -f "$BACKUP_DIR/config_$DATE.tar.gz" ]; then
echo "Restoring configuration from $DATE"
tar -xzf "$BACKUP_DIR/config_$DATE.tar.gz" -C /
sudo systemctl restart race-console
else
echo "No configuration backup found for $DATE"
exit 1
fi
Contact Information
For critical issues requiring immediate attention:
- System Administrator: [admin@yourcompany.com]
- Database Administrator: [dba@yourcompany.com]
- Development Team: [dev-team@yourcompany.com]
- Emergency Hotline: [+1-XXX-XXX-XXXX]
Escalation Procedures
- Level 1: Check logs and restart services
- Level 2: Database recovery and configuration rollback
- Level 3: Contact system administrator
- Level 4: Emergency hotline for critical production issues