Monitoring and Observability¶

Comprehensive monitoring setup for Navigator in production environments, including metrics, logging, alerting, and performance monitoring.

Quick Start¶

# 1. Enable structured logging
export LOG_LEVEL=info

# 2. Set up health check endpoint
curl http://localhost:3000/up

# 3. Monitor with systemd
sudo journalctl -u navigator -f

# 4. Basic metrics collection
ps aux | grep navigator

Health Monitoring¶

Health Check Endpoint¶

Navigator applications typically expose a health check endpoint:

# Basic health check
curl http://localhost:3000/up

# With timeout and failure detection
curl -f --max-time 5 http://localhost:3000/up || echo "Health check failed"

Rails setup (add to config/routes.rb):

Rails.application.routes.draw do
  get '/up', to: 'rails/health#show', as: :rails_health_check
end

Process Health Monitoring¶

#!/bin/bash
# /usr/local/bin/navigator-health.sh

# Check Navigator process
if ! pgrep -f navigator > /dev/null; then
    echo "CRITICAL: Navigator process not running"
    exit 2
fi

# Check port binding
if ! netstat -tlnp | grep -q ":3000.*navigator"; then
    echo "CRITICAL: Navigator not listening on port 3000"
    exit 2
fi

# Check HTTP response
if ! curl -f -s --max-time 5 http://localhost:3000/up > /dev/null; then
    echo "WARNING: Navigator health check failed"
    exit 1
fi

# Check Rails processes
rails_count=$(pgrep -f "rails server" | wc -l)
if [ "$rails_count" -eq 0 ]; then
    echo "WARNING: No Rails processes running"
    exit 1
fi

echo "OK: Navigator healthy, $rails_count Rails processes"
exit 0

Logging¶

Structured Logging Configuration¶

Navigator uses Go's slog package for structured logging:

# Set log level
export LOG_LEVEL=info    # debug, info, warn, error

# Run Navigator with structured logging
navigator config.yml

Log format example:

2024-09-02T17:20:42Z INFO Starting Navigator listen=:3000
2024-09-02T17:20:42Z INFO Process started app=main port=4001 pid=12345
2024-09-02T17:20:45Z DEBUG Request routed path=/users method=GET app=main
2024-09-02T17:20:45Z WARN Process idle timeout app=main idle_time=300s

Log Aggregation¶

systemd Journal Integration¶

# View Navigator logs
sudo journalctl -u navigator -f

# Search logs
sudo journalctl -u navigator | grep ERROR

# Export logs for analysis
sudo journalctl -u navigator --since yesterday --output json > navigator.log

rsyslog Configuration¶

/etc/rsyslog.d/navigator.conf

# Separate Navigator logs
:programname, isequal, "navigator" /var/log/navigator.log
& stop

Log Rotation¶

/etc/logrotate.d/navigator

/var/log/navigator.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    create 644 navigator navigator
    postrotate
        /usr/bin/systemctl reload navigator
    endscript
}

Metrics Collection¶

System Metrics¶

#!/bin/bash
# /usr/local/bin/navigator-metrics.sh

# Process metrics
echo "# Navigator process metrics"
echo "navigator_processes $(pgrep -f navigator | wc -l)"
echo "navigator_rails_processes $(pgrep -f 'rails server' | wc -l)"

# Memory usage (in bytes)
navigator_memory=$(ps -o pid,rss -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum*1024}')
echo "navigator_memory_bytes ${navigator_memory:-0}"

# CPU usage
navigator_cpu=$(ps -o pid,pcpu -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum}')
echo "navigator_cpu_percent ${navigator_cpu:-0}"

# Connection count
connections=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
echo "navigator_connections $connections"

# Port usage (4000-4099 range for Rails processes)
rails_ports=$(netstat -tlnp | grep -E ':40[0-9][0-9]' | wc -l)
echo "navigator_rails_ports_used $rails_ports"

Application Metrics¶

Monitor Rails application performance:

#!/bin/bash
# Rails application metrics from logs

# Request rate (requests per minute)
requests_per_min=$(tail -n 1000 /var/log/navigator.log | grep "$(date '+%Y-%m-%dT%H:%M')" | grep -c 'method=GET\|method=POST')
echo "rails_requests_per_minute $requests_per_min"

# Response time analysis
tail -n 1000 /var/log/navigator.log | grep "completed" | awk '{print $NF}' | sed 's/ms//' | awk '
{
    sum+=$1; 
    count++; 
    if($1>max) max=$1; 
    if(min=="" || $1<min) min=$1
} 
END {
    print "rails_response_time_avg", (count>0 ? sum/count : 0)
    print "rails_response_time_max", (max ? max : 0)
    print "rails_response_time_min", (min ? min : 0)
}'

Prometheus Integration¶

Metrics Export¶

/usr/local/bin/navigator-prometheus.sh

#!/bin/bash
# Export Navigator metrics in Prometheus format

# Write metrics to file for node_exporter textfile collector
METRICS_FILE="/var/lib/prometheus/node-exporter/navigator.prom"

{
    echo "# HELP navigator_up Navigator process status"
    echo "# TYPE navigator_up gauge"
    if pgrep -f navigator > /dev/null; then
        echo "navigator_up 1"
    else
        echo "navigator_up 0"
    fi

    echo "# HELP navigator_processes Number of Navigator processes"
    echo "# TYPE navigator_processes gauge"
    echo "navigator_processes $(pgrep -f navigator | wc -l)"

    echo "# HELP navigator_rails_processes Number of Rails processes"
    echo "# TYPE navigator_rails_processes gauge"
    echo "navigator_rails_processes $(pgrep -f 'rails server' | wc -l)"

    echo "# HELP navigator_memory_bytes Navigator memory usage in bytes"
    echo "# TYPE navigator_memory_bytes gauge"
    memory=$(ps -o pid,rss -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum*1024}')
    echo "navigator_memory_bytes ${memory:-0}"

    echo "# HELP navigator_connections_total Active connections"
    echo "# TYPE navigator_connections_total gauge"
    connections=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
    echo "navigator_connections_total $connections"
} > "$METRICS_FILE.tmp" && mv "$METRICS_FILE.tmp" "$METRICS_FILE"

# Run metrics collection every minute
echo "* * * * * navigator /usr/local/bin/navigator-prometheus.sh" | sudo crontab -u navigator -

Prometheus Configuration¶

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'navigator-node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'navigator-health'
    metrics_path: '/up'
    static_configs:
      - targets: ['localhost:3000']
    scrape_interval: 30s

Application Performance Monitoring¶

New Relic Integration¶

Navigator configuration

applications:
  global_env:
    NEW_RELIC_LICENSE_KEY: "${NEW_RELIC_LICENSE_KEY}"
    NEW_RELIC_APP_NAME: "Navigator Production"
    NEW_RELIC_DISTRIBUTED_TRACING_ENABLED: "true"

Rails: config/newrelic.yml

production:
  license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
  app_name: Navigator Rails App
  distributed_tracing:
    enabled: true
  transaction_tracer:
    enabled: true
  error_collector:
    enabled: true

Honeybadger Error Tracking¶

Navigator configuration

applications:
  global_env:
    HONEYBADGER_API_KEY: "${HONEYBADGER_API_KEY}"
    HONEYBADGER_ENV: "production"

Custom Rails Monitoring¶

Rails: config/initializers/navigator_monitoring.rb

# Custom middleware for Navigator-specific metrics
class NavigatorMonitoring
  def initialize(app)
    @app = app
  end

  def call(env)
    start_time = Time.current
    status, headers, response = @app.call(env)
    duration = (Time.current - start_time) * 1000

    # Log request metrics in Navigator-compatible format
    Rails.logger.info({
      event: 'request_completed',
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      status: status,
      duration_ms: duration.round(2),
      process_id: Process.pid
    }.to_json)

    [status, headers, response]
  rescue => e
    Rails.logger.error({
      event: 'request_error',
      error: e.class.name,
      message: e.message,
      path: env['PATH_INFO']
    }.to_json)
    raise
  end
end

Rails.application.config.middleware.use NavigatorMonitoring

Alerting¶

Basic Shell Script Alerts¶

/usr/local/bin/navigator-alerts.sh

#!/bin/bash
# Basic alerting script

ALERT_EMAIL="admin@example.com"
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

send_alert() {
    local severity=$1
    local message=$2

    # Email alert
    echo "Navigator Alert [$severity]: $message" | mail -s "Navigator Alert" "$ALERT_EMAIL"

    # Slack webhook
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"Navigator Alert [$severity]: $message\"}" \
        "$WEBHOOK_URL"
}

# Check Navigator health
if ! /usr/local/bin/navigator-health.sh > /dev/null; then
    send_alert "CRITICAL" "Navigator health check failed"
fi

# Check memory usage
memory_usage=$(ps -o pid,pmem -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum}')
if (( $(echo "$memory_usage > 80" | bc -l) )); then
    send_alert "WARNING" "Navigator memory usage high: ${memory_usage}%"
fi

# Check log for errors
error_count=$(journalctl -u navigator --since "5 minutes ago" -p err | wc -l)
if [ "$error_count" -gt 0 ]; then
    send_alert "WARNING" "Navigator logged $error_count errors in last 5 minutes"
fi

systemd Service Monitoring¶

/etc/systemd/system/navigator-monitor.service

[Unit]
Description=Navigator Monitoring
Requires=navigator.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/navigator-alerts.sh

[Install]
WantedBy=multi-user.target

/etc/systemd/system/navigator-monitor.timer

[Unit]
Description=Run Navigator monitoring every 5 minutes
Requires=navigator-monitor.service

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target

# Enable monitoring
sudo systemctl enable navigator-monitor.timer
sudo systemctl start navigator-monitor.timer

Dashboard Setup¶

Grafana Dashboard¶

navigator-dashboard.json

{
  "dashboard": {
    "title": "Navigator Monitoring",
    "panels": [
      {
        "title": "Navigator Status",
        "type": "stat",
        "targets": [
          {
            "expr": "navigator_up",
            "legendFormat": "Navigator Up"
          }
        ]
      },
      {
        "title": "Active Processes",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_processes",
            "legendFormat": "Navigator Processes"
          },
          {
            "expr": "navigator_rails_processes", 
            "legendFormat": "Rails Processes"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_memory_bytes",
            "legendFormat": "Memory Usage"
          }
        ]
      },
      {
        "title": "Active Connections",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_connections_total",
            "legendFormat": "Connections"
          }
        ]
      }
    ]
  }
}

Simple HTML Dashboard¶

/var/www/monitor/index.html

<!DOCTYPE html>
<html>
<head>
    <title>Navigator Status</title>
    <meta http-equiv="refresh" content="30">
</head>
<body>
    <h1>Navigator Status Dashboard</h1>

    <div id="status">
        <script>
            fetch('/cgi-bin/navigator-status.sh')
                .then(response => response.text())
                .then(data => {
                    document.getElementById('status').innerHTML = '<pre>' + data + '</pre>';
                });
        </script>
    </div>
</body>
</html>

Performance Monitoring¶

Response Time Monitoring¶

#!/bin/bash
# Monitor Navigator response times

measure_response_time() {
    local url=$1
    local name=$2

    time=$(curl -o /dev/null -s -w '%{time_total}\n' "$url")
    echo "response_time_seconds{endpoint=\"$name\"} $time"
}

# Monitor different endpoints
measure_response_time "http://localhost:3000/up" "health"
measure_response_time "http://localhost:3000/" "home"
measure_response_time "http://localhost:3000/api/users" "api"

Load Testing Integration¶

#!/bin/bash
# Automated load testing with monitoring

# Run load test
ab -n 1000 -c 10 http://localhost:3000/ > /tmp/load_test.out

# Extract key metrics
requests_per_second=$(grep "Requests per second" /tmp/load_test.out | awk '{print $4}')
mean_time=$(grep "Time per request" /tmp/load_test.out | head -1 | awk '{print $4}')

# Log results
echo "load_test_rps $requests_per_second"
echo "load_test_mean_time $mean_time"

# Alert if performance degrades
if (( $(echo "$requests_per_second < 50" | bc -l) )); then
    echo "WARNING: Low request rate: $requests_per_second RPS"
fi

Troubleshooting Monitoring¶

Common Issues¶

No Metrics Being Collected¶

# Check if scripts are executable
ls -la /usr/local/bin/navigator-*.sh

# Verify cron jobs
crontab -l -u navigator

# Test metric collection manually
/usr/local/bin/navigator-metrics.sh

Health Checks Failing¶

# Test health check manually
curl -v http://localhost:3000/up

# Check Navigator process
ps aux | grep navigator

# Verify port binding
netstat -tlnp | grep :3000

High Memory Usage Alerts¶

# Check actual memory usage
ps aux --sort=-%mem | head -10

# Monitor Rails process memory
ps aux | grep rails | awk '{print $6}' | sort -nr

# Check for memory leaks
while true; do
    ps -o pid,rss,cmd -p $(pgrep -f rails)
    sleep 60
done

Debug Logging¶

# Enable debug logging for troubleshooting
export LOG_LEVEL=debug
systemctl restart navigator

# Watch debug logs
journalctl -u navigator -f | grep DEBUG

Best Practices¶

1. Monitoring Strategy¶

Monitor both Navigator and Rails processes
Track system resources (CPU, memory, disk)
Set up both technical and business metrics
Use multiple monitoring tools for redundancy

2. Alerting Guidelines¶

Alert on symptoms, not just causes
Use different severity levels appropriately
Avoid alert fatigue with proper thresholds
Include runbook information in alerts

3. Performance Monitoring¶

Establish baseline performance metrics
Monitor end-to-end response times
Track error rates and types
Set up synthetic monitoring

4. Log Management¶

Use structured logging consistently
Implement proper log rotation
Centralize logs for analysis
Include correlation IDs for tracing

Integration Examples¶

CloudWatch (AWS)¶

# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U amazon-cloudwatch-agent.rpm

# Configure custom metrics
aws logs create-log-group --log-group-name navigator-logs

Datadog Integration¶

# Add to Navigator environment
applications:
  global_env:
    DD_API_KEY: "${DATADOG_API_KEY}"
    DD_SITE: "datadoghq.com"
    DD_SERVICE: "navigator"
    DD_VERSION: "1.0.0"