Skip to content

Monitoring and Observability

Comprehensive monitoring setup for Navigator in production environments, including metrics, logging, alerting, and performance monitoring.

Quick Start

# 1. Enable structured logging
export LOG_LEVEL=info

# 2. Set up health check endpoint
curl http://localhost:3000/up

# 3. Monitor with systemd
sudo journalctl -u navigator -f

# 4. Basic metrics collection
ps aux | grep navigator

Health Monitoring

Health Check Endpoint

Navigator applications typically expose a health check endpoint:

# Basic health check
curl http://localhost:3000/up

# With timeout and failure detection
curl -f --max-time 5 http://localhost:3000/up || echo "Health check failed"

Rails setup (add to config/routes.rb):

Rails.application.routes.draw do
  get '/up', to: 'rails/health#show', as: :rails_health_check
end

Process Health Monitoring

#!/bin/bash
# /usr/local/bin/navigator-health.sh

# Check Navigator process
if ! pgrep -f navigator > /dev/null; then
    echo "CRITICAL: Navigator process not running"
    exit 2
fi

# Check port binding
if ! netstat -tlnp | grep -q ":3000.*navigator"; then
    echo "CRITICAL: Navigator not listening on port 3000"
    exit 2
fi

# Check HTTP response
if ! curl -f -s --max-time 5 http://localhost:3000/up > /dev/null; then
    echo "WARNING: Navigator health check failed"
    exit 1
fi

# Check Rails processes
rails_count=$(pgrep -f "rails server" | wc -l)
if [ "$rails_count" -eq 0 ]; then
    echo "WARNING: No Rails processes running"
    exit 1
fi

echo "OK: Navigator healthy, $rails_count Rails processes"
exit 0

Logging

Structured Logging Configuration

Navigator uses Go's slog package for structured logging:

# Set log level
export LOG_LEVEL=info    # debug, info, warn, error

# Run Navigator with structured logging
navigator config.yml

Log format example:

2024-09-02T17:20:42Z INFO Starting Navigator listen=:3000
2024-09-02T17:20:42Z INFO Process started app=main port=4001 pid=12345
2024-09-02T17:20:45Z DEBUG Request routed path=/users method=GET app=main
2024-09-02T17:20:45Z WARN Process idle timeout app=main idle_time=300s

Log Aggregation

systemd Journal Integration

# View Navigator logs
sudo journalctl -u navigator -f

# Search logs
sudo journalctl -u navigator | grep ERROR

# Export logs for analysis
sudo journalctl -u navigator --since yesterday --output json > navigator.log

rsyslog Configuration

/etc/rsyslog.d/navigator.conf
# Separate Navigator logs
:programname, isequal, "navigator" /var/log/navigator.log
& stop

Log Rotation

/etc/logrotate.d/navigator
/var/log/navigator.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    create 644 navigator navigator
    postrotate
        /usr/bin/systemctl reload navigator
    endscript
}

Metrics Collection

System Metrics

#!/bin/bash
# /usr/local/bin/navigator-metrics.sh

# Process metrics
echo "# Navigator process metrics"
echo "navigator_processes $(pgrep -f navigator | wc -l)"
echo "navigator_rails_processes $(pgrep -f 'rails server' | wc -l)"

# Memory usage (in bytes)
navigator_memory=$(ps -o pid,rss -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum*1024}')
echo "navigator_memory_bytes ${navigator_memory:-0}"

# CPU usage
navigator_cpu=$(ps -o pid,pcpu -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum}')
echo "navigator_cpu_percent ${navigator_cpu:-0}"

# Connection count
connections=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
echo "navigator_connections $connections"

# Port usage (4000-4099 range for Rails processes)
rails_ports=$(netstat -tlnp | grep -E ':40[0-9][0-9]' | wc -l)
echo "navigator_rails_ports_used $rails_ports"

Application Metrics

Monitor Rails application performance:

#!/bin/bash
# Rails application metrics from logs

# Request rate (requests per minute)
requests_per_min=$(tail -n 1000 /var/log/navigator.log | grep "$(date '+%Y-%m-%dT%H:%M')" | grep -c 'method=GET\|method=POST')
echo "rails_requests_per_minute $requests_per_min"

# Response time analysis
tail -n 1000 /var/log/navigator.log | grep "completed" | awk '{print $NF}' | sed 's/ms//' | awk '
{
    sum+=$1; 
    count++; 
    if($1>max) max=$1; 
    if(min=="" || $1<min) min=$1
} 
END {
    print "rails_response_time_avg", (count>0 ? sum/count : 0)
    print "rails_response_time_max", (max ? max : 0)
    print "rails_response_time_min", (min ? min : 0)
}'

Prometheus Integration

Metrics Export

/usr/local/bin/navigator-prometheus.sh
#!/bin/bash
# Export Navigator metrics in Prometheus format

# Write metrics to file for node_exporter textfile collector
METRICS_FILE="/var/lib/prometheus/node-exporter/navigator.prom"

{
    echo "# HELP navigator_up Navigator process status"
    echo "# TYPE navigator_up gauge"
    if pgrep -f navigator > /dev/null; then
        echo "navigator_up 1"
    else
        echo "navigator_up 0"
    fi

    echo "# HELP navigator_processes Number of Navigator processes"
    echo "# TYPE navigator_processes gauge"
    echo "navigator_processes $(pgrep -f navigator | wc -l)"

    echo "# HELP navigator_rails_processes Number of Rails processes"
    echo "# TYPE navigator_rails_processes gauge"
    echo "navigator_rails_processes $(pgrep -f 'rails server' | wc -l)"

    echo "# HELP navigator_memory_bytes Navigator memory usage in bytes"
    echo "# TYPE navigator_memory_bytes gauge"
    memory=$(ps -o pid,rss -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum*1024}')
    echo "navigator_memory_bytes ${memory:-0}"

    echo "# HELP navigator_connections_total Active connections"
    echo "# TYPE navigator_connections_total gauge"
    connections=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
    echo "navigator_connections_total $connections"
} > "$METRICS_FILE.tmp" && mv "$METRICS_FILE.tmp" "$METRICS_FILE"
# Run metrics collection every minute
echo "* * * * * navigator /usr/local/bin/navigator-prometheus.sh" | sudo crontab -u navigator -

Prometheus Configuration

prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'navigator-node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'navigator-health'
    metrics_path: '/up'
    static_configs:
      - targets: ['localhost:3000']
    scrape_interval: 30s

Application Performance Monitoring

New Relic Integration

Navigator configuration
applications:
  global_env:
    NEW_RELIC_LICENSE_KEY: "${NEW_RELIC_LICENSE_KEY}"
    NEW_RELIC_APP_NAME: "Navigator Production"
    NEW_RELIC_DISTRIBUTED_TRACING_ENABLED: "true"
Rails: config/newrelic.yml
production:
  license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
  app_name: Navigator Rails App
  distributed_tracing:
    enabled: true
  transaction_tracer:
    enabled: true
  error_collector:
    enabled: true

Honeybadger Error Tracking

Navigator configuration
applications:
  global_env:
    HONEYBADGER_API_KEY: "${HONEYBADGER_API_KEY}"
    HONEYBADGER_ENV: "production"

Custom Rails Monitoring

Rails: config/initializers/navigator_monitoring.rb
# Custom middleware for Navigator-specific metrics
class NavigatorMonitoring
  def initialize(app)
    @app = app
  end

  def call(env)
    start_time = Time.current
    status, headers, response = @app.call(env)
    duration = (Time.current - start_time) * 1000

    # Log request metrics in Navigator-compatible format
    Rails.logger.info({
      event: 'request_completed',
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      status: status,
      duration_ms: duration.round(2),
      process_id: Process.pid
    }.to_json)

    [status, headers, response]
  rescue => e
    Rails.logger.error({
      event: 'request_error',
      error: e.class.name,
      message: e.message,
      path: env['PATH_INFO']
    }.to_json)
    raise
  end
end

Rails.application.config.middleware.use NavigatorMonitoring

Alerting

Basic Shell Script Alerts

/usr/local/bin/navigator-alerts.sh
#!/bin/bash
# Basic alerting script

ALERT_EMAIL="admin@example.com"
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

send_alert() {
    local severity=$1
    local message=$2

    # Email alert
    echo "Navigator Alert [$severity]: $message" | mail -s "Navigator Alert" "$ALERT_EMAIL"

    # Slack webhook
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"Navigator Alert [$severity]: $message\"}" \
        "$WEBHOOK_URL"
}

# Check Navigator health
if ! /usr/local/bin/navigator-health.sh > /dev/null; then
    send_alert "CRITICAL" "Navigator health check failed"
fi

# Check memory usage
memory_usage=$(ps -o pid,pmem -p $(pgrep -f navigator) | tail -n +2 | awk '{sum+=$2} END {print sum}')
if (( $(echo "$memory_usage > 80" | bc -l) )); then
    send_alert "WARNING" "Navigator memory usage high: ${memory_usage}%"
fi

# Check log for errors
error_count=$(journalctl -u navigator --since "5 minutes ago" -p err | wc -l)
if [ "$error_count" -gt 0 ]; then
    send_alert "WARNING" "Navigator logged $error_count errors in last 5 minutes"
fi

systemd Service Monitoring

/etc/systemd/system/navigator-monitor.service
[Unit]
Description=Navigator Monitoring
Requires=navigator.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/navigator-alerts.sh

[Install]
WantedBy=multi-user.target
/etc/systemd/system/navigator-monitor.timer
[Unit]
Description=Run Navigator monitoring every 5 minutes
Requires=navigator-monitor.service

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target
# Enable monitoring
sudo systemctl enable navigator-monitor.timer
sudo systemctl start navigator-monitor.timer

Dashboard Setup

Grafana Dashboard

navigator-dashboard.json
{
  "dashboard": {
    "title": "Navigator Monitoring",
    "panels": [
      {
        "title": "Navigator Status",
        "type": "stat",
        "targets": [
          {
            "expr": "navigator_up",
            "legendFormat": "Navigator Up"
          }
        ]
      },
      {
        "title": "Active Processes",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_processes",
            "legendFormat": "Navigator Processes"
          },
          {
            "expr": "navigator_rails_processes", 
            "legendFormat": "Rails Processes"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_memory_bytes",
            "legendFormat": "Memory Usage"
          }
        ]
      },
      {
        "title": "Active Connections",
        "type": "graph",
        "targets": [
          {
            "expr": "navigator_connections_total",
            "legendFormat": "Connections"
          }
        ]
      }
    ]
  }
}

Simple HTML Dashboard

/var/www/monitor/index.html
<!DOCTYPE html>
<html>
<head>
    <title>Navigator Status</title>
    <meta http-equiv="refresh" content="30">
</head>
<body>
    <h1>Navigator Status Dashboard</h1>

    <div id="status">
        <script>
            fetch('/cgi-bin/navigator-status.sh')
                .then(response => response.text())
                .then(data => {
                    document.getElementById('status').innerHTML = '<pre>' + data + '</pre>';
                });
        </script>
    </div>
</body>
</html>

Performance Monitoring

Response Time Monitoring

#!/bin/bash
# Monitor Navigator response times

measure_response_time() {
    local url=$1
    local name=$2

    time=$(curl -o /dev/null -s -w '%{time_total}\n' "$url")
    echo "response_time_seconds{endpoint=\"$name\"} $time"
}

# Monitor different endpoints
measure_response_time "http://localhost:3000/up" "health"
measure_response_time "http://localhost:3000/" "home"
measure_response_time "http://localhost:3000/api/users" "api"

Load Testing Integration

#!/bin/bash
# Automated load testing with monitoring

# Run load test
ab -n 1000 -c 10 http://localhost:3000/ > /tmp/load_test.out

# Extract key metrics
requests_per_second=$(grep "Requests per second" /tmp/load_test.out | awk '{print $4}')
mean_time=$(grep "Time per request" /tmp/load_test.out | head -1 | awk '{print $4}')

# Log results
echo "load_test_rps $requests_per_second"
echo "load_test_mean_time $mean_time"

# Alert if performance degrades
if (( $(echo "$requests_per_second < 50" | bc -l) )); then
    echo "WARNING: Low request rate: $requests_per_second RPS"
fi

Troubleshooting Monitoring

Common Issues

No Metrics Being Collected

# Check if scripts are executable
ls -la /usr/local/bin/navigator-*.sh

# Verify cron jobs
crontab -l -u navigator

# Test metric collection manually
/usr/local/bin/navigator-metrics.sh

Health Checks Failing

# Test health check manually
curl -v http://localhost:3000/up

# Check Navigator process
ps aux | grep navigator

# Verify port binding
netstat -tlnp | grep :3000

High Memory Usage Alerts

# Check actual memory usage
ps aux --sort=-%mem | head -10

# Monitor Rails process memory
ps aux | grep rails | awk '{print $6}' | sort -nr

# Check for memory leaks
while true; do
    ps -o pid,rss,cmd -p $(pgrep -f rails)
    sleep 60
done

Debug Logging

# Enable debug logging for troubleshooting
export LOG_LEVEL=debug
systemctl restart navigator

# Watch debug logs
journalctl -u navigator -f | grep DEBUG

Best Practices

1. Monitoring Strategy

  • Monitor both Navigator and Rails processes
  • Track system resources (CPU, memory, disk)
  • Set up both technical and business metrics
  • Use multiple monitoring tools for redundancy

2. Alerting Guidelines

  • Alert on symptoms, not just causes
  • Use different severity levels appropriately
  • Avoid alert fatigue with proper thresholds
  • Include runbook information in alerts

3. Performance Monitoring

  • Establish baseline performance metrics
  • Monitor end-to-end response times
  • Track error rates and types
  • Set up synthetic monitoring

4. Log Management

  • Use structured logging consistently
  • Implement proper log rotation
  • Centralize logs for analysis
  • Include correlation IDs for tracing

Integration Examples

CloudWatch (AWS)

# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U amazon-cloudwatch-agent.rpm

# Configure custom metrics
aws logs create-log-group --log-group-name navigator-logs

Datadog Integration

# Add to Navigator environment
applications:
  global_env:
    DD_API_KEY: "${DATADOG_API_KEY}"
    DD_SITE: "datadoghq.com"
    DD_SERVICE: "navigator"
    DD_VERSION: "1.0.0"

See Also