Skip to main content
Innovation|Innovation

Resilience, Event-Driven Architecture + DevOps: Building Apps That Never Go Down

A beginner-friendly deep dive into three pillars of production-grade applications: resilience patterns (rate limiting, circuit breakers, bulkheads), event-driven messaging with Kafka and Spring Events, and DevOps observability with Prometheus, Grafana, ELK, and Zipkin.

April 8, 202620 min read3 views0 comments
Share:

Why Your App Needs to Be Tough

Imagine you run a pizza shop. One day, the oven breaks. If you have only one oven, nobody gets pizza. But if you have a backup oven, a plan for when ovens break, and a way to know the oven is about to break before it happens — that is resilience.

In software, resilience means your application keeps working even when things go wrong — a database goes slow, a partner service crashes, or a million users show up at once. This guide covers three pillars that make apps nearly unbreakable: Resilience Patterns, Event-Driven Messaging, and DevOps Observability.


Section 1 — Resilience: Protecting Your App from Overload and Failure

What Is Rate Limiting?

Rate limiting is like a theme park ride — only 100 people can ride at a time, and the rest wait in line. Without a line, everyone rushes the ride at once and it breaks. Rate limiting controls how many requests a user or service can make in a given time window.

Rate Limiting Algorithms

There are four main algorithms. Think of each one as a different way to manage that theme park line.

1. Fixed Window

Divide time into equal chunks (say, 1-minute windows). Each window allows a fixed number of requests. When the window resets, the counter goes back to zero.

Analogy: A candy store gives each kid 5 candies per hour. At the start of every hour, the count resets — even if you ate all 5 in the first minute.

Timeline:
|--- Window 1 (00:00-00:59) ---|--- Window 2 (01:00-01:59) ---|
  Request 1 ✓  (count: 1)         Request 6 ✓  (count: 1)
  Request 2 ✓  (count: 2)         Request 7 ✓  (count: 2)
  Request 3 ✓  (count: 3)         ...
  Request 4 ✓  (count: 4)
  Request 5 ✓  (count: 5)
  Request 6 ✗  (REJECTED — limit 5 reached)

Problem: "Boundary burst" — 5 requests at 00:59 + 5 at 01:00 = 10 in 2 seconds!

2. Sliding Window

Instead of fixed chunks, the window slides with you. It always looks at the last 60 seconds from right now. This fixes the boundary burst problem.

Analogy: Instead of resetting candy every hour on the dot, the store always looks at "how many candies did this kid eat in the last 60 minutes?" — no matter what time it is.

Current time: 01:30
Look back 60 seconds: 00:30 to 01:30
Count all requests in that range.
If count >= limit → REJECT
If count < limit  → ALLOW

Result: Smooth, consistent rate limiting with no burst at boundaries.

3. Token Bucket

Imagine a bucket that holds tokens. Every second, a new token drops in. To make a request, you take a token out. If the bucket is empty, you wait. The bucket has a maximum size, so tokens don't pile up forever.

Analogy: A gumball machine refills one gumball every 10 seconds. You can grab one anytime there is a gumball. If the machine is empty, you wait. But it never holds more than 10 gumballs, even if nobody uses it for a while.

Bucket: max 10 tokens, refill 1 token/second

Time 0s:  Bucket = 10 tokens
Time 0s:  Burst of 10 requests → all succeed, bucket = 0
Time 1s:  1 token refilled → bucket = 1 → 1 request succeeds
Time 2s:  1 token refilled → bucket = 1 → 1 request succeeds
Time 10s: If idle, bucket = 10 again (back to full)

Key benefit: Allows short bursts while maintaining average rate.

4. Leaky Bucket

Requests pour in at the top like water. The bucket "leaks" (processes) them at a fixed rate from the bottom. If water pours in too fast, the bucket overflows and extra requests are dropped.

Analogy: A funnel on a water bottle — no matter how fast you pour water in, it drips out at the same steady pace. Pour too fast and it overflows.

Leaky Bucket: capacity 10, leak rate 2/second

Incoming: 5 requests arrive at once
Bucket:   [■ ■ ■ ■ ■ · · · · ·]  (5/10 — all queued)
Output:   2 requests processed per second

Incoming: 8 more requests arrive
Bucket:   [■ ■ ■ ■ ■ ■ ■ ■ ■ ■]  (10/10 — full!)
Next req: DROPPED (bucket overflow)
Output:   Still processing at steady 2/second

Key benefit: Perfectly smooth output rate, no bursts at all.

Algorithm Comparison Table

AlgorithmBurst HandlingMemoryAccuracyBest For
Fixed WindowBoundary burst issueLowMediumSimple APIs, low traffic
Sliding WindowNo burst issuesMediumHighProduction APIs needing precision
Token BucketAllows controlled burstsLowHighAPIs that need burst tolerance
Leaky BucketNo bursts — smooth outputLowHighSteady-rate processing (queues)

Circuit Breaker Pattern

A circuit breaker works exactly like the one in your house. If too much electricity flows through, the breaker trips and cuts the power to prevent a fire. In software, if a service you call keeps failing, the circuit breaker "trips" and stops calling it — so your whole system doesn't crash waiting for a dead service.

Circuit Breaker States:
┌────────┐    failures >= threshold    ┌──────────┐
│ CLOSED │ ─────────────────────────→ │   OPEN   │
│(normal)│                             │(blocking)│
└────────┘                             └──────────┘
     ↑                                      │
     │         wait timeout expires         │
     │                                      ▼
     │                               ┌──────────────┐
     └────── success ───────────────│  HALF-OPEN   │
                                     │(testing 1 req)│
                                     └──────────────┘
                                            │
                               failure → back to OPEN

CLOSED — Everything works normally. Requests pass through. The breaker counts failures.

OPEN — Too many failures! The breaker blocks all requests immediately (fast fail). No more waiting 30 seconds for a timeout from a dead service.

HALF-OPEN — After a wait period, the breaker lets ONE request through to test if the service recovered. If it succeeds, go back to CLOSED. If it fails, go back to OPEN.

Circuit Breaker vs Rate Limiter

AspectCircuit BreakerRate Limiter
PurposeProtect from failing downstream servicesProtect from too many incoming requests
DirectionOutgoing calls (you → other service)Incoming calls (user → you)
Triggers onFailure rate (errors, timeouts)Request count per time window
ResponseFast fail with fallbackHTTP 429 Too Many Requests
AnalogyHouse circuit breaker (prevents fire)Theme park ride line (controls crowd)
Used together?YES — rate limit incoming, circuit break outgoing

Resilience4j Configuration

Resilience4j is the go-to library for resilience patterns in Java/Spring Boot. Think of it as a toolbox with four tools: Circuit Breaker, Retry, Rate Limiter, and Bulkhead.

Circuit Breaker Config

// application.yml
resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        slidingWindowSize: 10          # Look at last 10 calls
        failureRateThreshold: 50       # Trip if 50% fail
        waitDurationInOpenState: 10s   # Wait 10s before testing
        permittedNumberOfCallsInHalfOpenState: 3  # Test with 3 calls
        slidingWindowType: COUNT_BASED
// PaymentClient.java
@Service
public class PaymentClient {

    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    public PaymentResponse processPayment(PaymentRequest request) {
        return restTemplate.postForObject(
            "http://payment-service/api/payments", request, PaymentResponse.class
        );
    }

    // Fallback: runs when circuit is OPEN or call fails
    private PaymentResponse paymentFallback(PaymentRequest request, Throwable ex) {
        log.warn("Payment service down, queuing for retry: {}", ex.getMessage());
        return PaymentResponse.builder()
            .status("QUEUED")
            .message("Payment will be processed shortly")
            .build();
    }
}

Retry Config

// application.yml
resilience4j:
  retry:
    instances:
      inventoryService:
        maxAttempts: 3                       # Try 3 times total
        waitDuration: 500ms                  # Wait 500ms between retries
        retryExceptions:
          - java.io.IOException              # Retry on network errors
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - com.example.BusinessException    # Don't retry business errors
// InventoryClient.java
@Service
public class InventoryClient {

    @Retry(name = "inventoryService", fallbackMethod = "inventoryFallback")
    public InventoryResponse checkStock(String productId) {
        return restTemplate.getForObject(
            "http://inventory-service/api/stock/" + productId,
            InventoryResponse.class
        );
    }

    private InventoryResponse inventoryFallback(String productId, Throwable ex) {
        log.warn("Inventory check failed after retries for product {}: {}",
                 productId, ex.getMessage());
        return InventoryResponse.builder()
            .productId(productId)
            .available(false)
            .message("Stock status temporarily unavailable")
            .build();
    }
}

Rate Limiter Config

// application.yml
resilience4j:
  ratelimiter:
    instances:
      orderApi:
        limitForPeriod: 100              # 100 requests allowed
        limitRefreshPeriod: 1s           # Per 1 second
        timeoutDuration: 500ms           # Wait up to 500ms for a permit
// OrderController.java
@RestController
@RequestMapping("/api/orders")
public class OrderController {

    @RateLimiter(name = "orderApi")
    @PostMapping
    public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
        return ResponseEntity.ok(orderService.createOrder(request));
    }
    // If limit exceeded → RequestNotPermitted exception → HTTP 429
}

Bulkhead Pattern

A bulkhead is like the watertight compartments in a ship. If one compartment floods, the others stay dry and the ship stays afloat. In software, a bulkhead limits how many threads (or concurrent calls) one operation can use — so a slow service can't eat up all your threads and starve everything else.

// application.yml
resilience4j:
  bulkhead:
    instances:
      reportService:
        maxConcurrentCalls: 5          # Only 5 threads at a time
        maxWaitDuration: 100ms         # Wait max 100ms for a slot
// ReportController.java
@RestController
public class ReportController {

    @Bulkhead(name = "reportService", fallbackMethod = "reportFallback")
    @GetMapping("/api/reports/{id}")
    public ReportResponse getReport(@PathVariable String id) {
        // Even if report generation is slow, only 5 threads used at most
        return reportService.generate(id);
    }

    private ReportResponse reportFallback(String id, Throwable ex) {
        return ReportResponse.builder()
            .status("BUSY")
            .message("Report service is at capacity, please try again shortly")
            .build();
    }
}

Combining All Four Together

// You can stack annotations — they execute in this order:
// Retry → CircuitBreaker → RateLimiter → Bulkhead → Your method

@Retry(name = "orderService")
@CircuitBreaker(name = "orderService", fallbackMethod = "fallback")
@RateLimiter(name = "orderService")
@Bulkhead(name = "orderService")
public OrderResponse placeOrder(OrderRequest request) {
    return restTemplate.postForObject(
        "http://order-service/api/orders", request, OrderResponse.class
    );
}

Section 2 — Event-Driven Architecture & Messaging with Kafka

What Is Event-Driven Architecture?

In traditional apps, services call each other directly: "Hey order-service, I need the order details!" If order-service is down, the caller is stuck.

In event-driven architecture, services communicate by publishing events (messages). Nobody calls anyone directly. Instead, they drop a message in a shared mailbox and whoever is interested picks it up.

Analogy: Instead of calling your friend on the phone (synchronous — they must pick up), you send a letter through the post office (asynchronous — they read it when they can). If your friend is on vacation, the letter waits in their mailbox.

Kafka Core Concepts

Apache Kafka is the most popular message broker for event-driven systems. Think of Kafka as a giant post office. Here are its key parts:

1. Topic

A topic is a named mailbox. You create topics for different types of events: order-events, payment-events, inventory-alerts. Producers put messages IN, consumers take messages OUT.

2. Partition

Each topic is split into partitions (numbered 0, 1, 2, ...). Partitions allow parallel processing — multiple consumers can read from different partitions at the same time.

Topic: order-events (3 partitions)
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Partition 0 │  │ Partition 1 │  │ Partition 2 │
│ [msg0][msg3] │  │ [msg1][msg4] │  │ [msg2][msg5] │
│  [msg6]...  │  │  [msg7]...  │  │  [msg8]...  │
└─────────────┘  └─────────────┘  └─────────────┘

Messages with the same KEY always go to the same partition.
→ All events for order-123 go to the same partition (ordered!).

3. Consumer Group

A consumer group is a team of consumers that work together. Kafka assigns each partition to exactly ONE consumer in the group. This means each message is processed only once per group.

Consumer Group: "order-processor"
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Partition 0 │  │ Partition 1 │  │ Partition 2 │
│      ↓      │  │      ↓      │  │      ↓      │
│ Consumer A  │  │ Consumer B  │  │ Consumer C  │
└─────────────┘  └─────────────┘  └─────────────┘

If Consumer B crashes → Kafka re-assigns Partition 1 to A or C.
If you add Consumer D → Kafka rebalances (one consumer may go idle
if there are more consumers than partitions).

4. Offset

An offset is a bookmark. It tells Kafka: "I have read up to message #47 in this partition." If a consumer crashes and restarts, it picks up from the last committed offset — no messages lost, no duplicates.

Partition 0:  [msg0] [msg1] [msg2] [msg3] [msg4] [msg5]
                                     ↑
                              committed offset = 3
                    Consumer will read msg3 next after restart.

5. Key

The message key determines which partition a message goes to. Messages with the same key always land in the same partition. This guarantees ordering for related events.

Key = "order-123"  → hash("order-123") % 3 = Partition 1
Key = "order-456"  → hash("order-456") % 3 = Partition 0
Key = null          → Round-robin across partitions

Kafka Producer with KafkaTemplate

// KafkaProducerConfig.java
@Configuration
public class KafkaProducerConfig {

    @Bean
    public ProducerFactory<String, String> producerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        config.put(ProducerConfig.ACKS_CONFIG, "all");  // Wait for all replicas
        return new DefaultKafkaProducerFactory<>(config);
    }

    @Bean
    public KafkaTemplate<String, String> kafkaTemplate() {
        return new KafkaTemplate<>(producerFactory());
    }
}
// OrderEventPublisher.java
@Service
@RequiredArgsConstructor
@Slf4j
public class OrderEventPublisher {

    private final KafkaTemplate<String, String> kafkaTemplate;
    private final ObjectMapper objectMapper;

    public void publishOrderCreated(Order order) {
        try {
            OrderEvent event = OrderEvent.builder()
                .eventType("ORDER_CREATED")
                .orderId(order.getId())
                .customerId(order.getCustomerId())
                .totalAmount(order.getTotalAmount())
                .timestamp(Instant.now())
                .build();

            String payload = objectMapper.writeValueAsString(event);

            // Key = orderId → all events for same order go to same partition
            kafkaTemplate.send("order-events", order.getId(), payload)
                .whenComplete((result, ex) -> {
                    if (ex == null) {
                        log.info("Published ORDER_CREATED for order {}. Partition: {}, Offset: {}",
                            order.getId(),
                            result.getRecordMetadata().partition(),
                            result.getRecordMetadata().offset());
                    } else {
                        log.error("Failed to publish event for order {}: {}",
                            order.getId(), ex.getMessage());
                    }
                });
        } catch (JsonProcessingException e) {
            log.error("Failed to serialize order event: {}", e.getMessage());
        }
    }
}

Kafka Consumer with @KafkaListener

// KafkaConsumerConfig.java
@Configuration
public class KafkaConsumerConfig {

    @Bean
    public ConsumerFactory<String, String> consumerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(ConsumerConfig.GROUP_ID_CONFIG, "payment-processor");
        config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        config.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false); // Manual commit
        return new DefaultKafkaConsumerFactory<>(config);
    }

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String>
            kafkaListenerContainerFactory() {
        ConcurrentKafkaListenerContainerFactory<String, String> factory =
            new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory());
        factory.getContainerProperties()
            .setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
        return factory;
    }
}
// PaymentEventListener.java
@Service
@Slf4j
public class PaymentEventListener {

    @KafkaListener(
        topics = "order-events",
        groupId = "payment-processor",
        concurrency = "3"   // 3 threads → can consume from 3 partitions in parallel
    )
    public void handleOrderEvent(
            @Payload String message,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset,
            Acknowledgment acknowledgment) {

        try {
            OrderEvent event = objectMapper.readValue(message, OrderEvent.class);
            log.info("Received {} from partition {} at offset {}",
                event.getEventType(), partition, offset);

            if ("ORDER_CREATED".equals(event.getEventType())) {
                paymentService.initiatePayment(event.getOrderId(), event.getTotalAmount());
            }

            // Manually acknowledge — Kafka saves the offset
            acknowledgment.acknowledge();

        } catch (Exception e) {
            log.error("Failed to process message at partition {} offset {}: {}",
                partition, offset, e.getMessage());
            // Don't acknowledge — message will be redelivered
        }
    }
}

Event-Driven Patterns

1. Event Sourcing

Instead of storing just the current state (like "order total = $50"), you store every event that happened: "item added $20", "item added $30", "coupon applied -$5". The current state is rebuilt by replaying all events.

Analogy: Instead of looking at your bank balance ($500), you keep every transaction: +$1000 salary, -$200 rent, -$300 groceries. You can always recalculate the balance AND answer questions like "how much did I spend on groceries last month?"

// Events stored in order:
OrderCreated { orderId: "123", customerId: "456" }
ItemAdded    { orderId: "123", product: "Widget", price: 20.00 }
ItemAdded    { orderId: "123", product: "Gadget", price: 30.00 }
CouponApplied { orderId: "123", discount: 5.00 }
OrderConfirmed { orderId: "123", total: 45.00 }

// Replay events → current state:
Order { id: "123", items: [Widget, Gadget], total: 45.00, status: CONFIRMED }

2. CQRS (Command Query Responsibility Segregation)

Use separate models for reading and writing. The write side handles commands ("create order", "update stock"). The read side is optimized for queries ("show me all orders this week").

Analogy: A restaurant has a kitchen (write side — creates food) and a menu (read side — shows what is available). The kitchen and menu are separate things, optimized for different purposes.

Write Side (Commands)                  Read Side (Queries)
┌─────────────────┐                    ┌─────────────────┐
│  OrderCommand   │                    │  OrderQuery     │
│  Handler        │───publish event──→│  Handler        │
│                 │                    │                 │
│ Normalized DB   │                    │ Denormalized DB │
│ (3rd normal form)│                   │ (flat, fast)    │
└─────────────────┘                    └─────────────────┘

Benefits:
- Write DB can be PostgreSQL (strong consistency)
- Read DB can be Elasticsearch (fast search)
- Scale reads and writes independently

3. Saga Pattern

A saga manages a transaction that spans multiple services. Instead of one big transaction (which is impossible across services), it runs a chain of local transactions. If one step fails, it runs compensating transactions to undo previous steps.

Analogy: Booking a vacation: reserve flight → reserve hotel → reserve car. If the car rental fails, you cancel the hotel, then cancel the flight — in reverse order.

Happy Path:
  order-service: Create Order (PENDING)
       ↓ event: OrderCreated
  payment-service: Charge Payment
       ↓ event: PaymentCompleted
  inventory-service: Reserve Stock
       ↓ event: StockReserved
  order-service: Confirm Order (CONFIRMED) ✓

Failure Path (stock unavailable):
  order-service: Create Order (PENDING)
       ↓ event: OrderCreated
  payment-service: Charge Payment
       ↓ event: PaymentCompleted
  inventory-service: Reserve Stock → FAILS!
       ↓ event: StockReservationFailed
  payment-service: REFUND Payment (compensating transaction)
       ↓ event: PaymentRefunded
  order-service: Cancel Order (CANCELLED) ✗

4. Outbox Pattern

The outbox pattern solves a dangerous problem: what if your service saves to the database but crashes before sending the Kafka event? The database says "order created" but no event was published — the system is inconsistent.

Solution: write the event to an outbox table in the SAME database transaction as your business data. A separate process reads the outbox table and publishes events to Kafka.

// Step 1: Save order + outbox event in ONE transaction
@Transactional
public Order createOrder(OrderRequest request) {
    Order order = orderRepository.save(new Order(request));

    // Save event to outbox table (same transaction!)
    outboxRepository.save(OutboxEvent.builder()
        .aggregateId(order.getId())
        .eventType("ORDER_CREATED")
        .payload(objectMapper.writeValueAsString(order))
        .status("PENDING")
        .build());

    return order;  // Both saved atomically — or both fail
}

// Step 2: Background poller reads outbox and publishes to Kafka
@Scheduled(fixedDelay = 1000)
public void publishOutboxEvents() {
    List<OutboxEvent> pending = outboxRepository.findByStatus("PENDING");
    for (OutboxEvent event : pending) {
        kafkaTemplate.send("order-events", event.getAggregateId(), event.getPayload());
        event.setStatus("PUBLISHED");
        outboxRepository.save(event);
    }
}

Spring Events — In-Process Messaging

Not everything needs Kafka. For events within the same application, Spring has built-in event publishing. It is simpler and perfect for decoupling code inside one service.

// 1. Define an event
public record OrderCompletedEvent(String orderId, BigDecimal total, String customerEmail) {}

// 2. Publish the event
@Service
@RequiredArgsConstructor
public class OrderService {
    private final ApplicationEventPublisher eventPublisher;

    @Transactional
    public Order completeOrder(String orderId) {
        Order order = orderRepository.findById(orderId).orElseThrow();
        order.setStatus("COMPLETED");
        orderRepository.save(order);

        // Publish event — listeners will handle side effects
        eventPublisher.publishEvent(new OrderCompletedEvent(
            order.getId(), order.getTotal(), order.getCustomerEmail()
        ));
        return order;
    }
}

// 3a. @EventListener — runs immediately (same thread, same transaction)
@Component
@Slf4j
public class InventoryListener {
    @EventListener
    public void onOrderCompleted(OrderCompletedEvent event) {
        log.info("Reducing stock for order {}", event.orderId());
        inventoryService.reduceStock(event.orderId());
    }
}

// 3b. @TransactionalEventListener — runs AFTER transaction commits
//     Safer for side effects like sending emails
@Component
@Slf4j
public class NotificationListener {
    @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
    public void onOrderCompleted(OrderCompletedEvent event) {
        log.info("Sending confirmation email for order {}", event.orderId());
        emailService.sendConfirmation(event.customerEmail(), event.orderId());
    }
}

// Why AFTER_COMMIT? If the transaction rolls back, you don't want
// to have already sent a "your order is confirmed!" email.

Section 3 — DevOps + Observability: Seeing Inside Your Running App

Spring Boot Actuator — Health & Monitoring Endpoints

Actuator gives your app built-in health check endpoints. It is like a dashboard in your car — speed, fuel level, engine temperature — but for your application.

// application.yml — Actuator configuration
management:
  endpoints:
    web:
      exposure:
        include: health, info, prometheus, loggers, metrics
  endpoint:
    health:
      show-details: always           # Show DB, Redis, Kafka health
      probes:
        enabled: true                # Enable Kubernetes probes
  health:
    livenessState:
      enabled: true
    readinessState:
      enabled: true

Key Actuator Endpoints

EndpointURLPurpose
Health/actuator/healthOverall app health (UP/DOWN) + dependencies
Liveness/actuator/health/livenessIs the app alive? (Kubernetes restarts if DOWN)
Readiness/actuator/health/readinessCan the app handle traffic? (K8s stops sending requests if DOWN)
Prometheus/actuator/prometheusMetrics in Prometheus format (scraped every 15s)
Loggers/actuator/loggersView/change log levels at runtime (no restart!)
// Example: Change log level at runtime (no redeploy!)
// POST /actuator/loggers/com.example.orderservice
// Body: { "configuredLevel": "DEBUG" }

// Example health response:
{
  "status": "UP",
  "components": {
    "db": { "status": "UP", "details": { "database": "PostgreSQL" } },
    "redis": { "status": "UP", "details": { "version": "7.2.4" } },
    "kafka": { "status": "UP" },
    "diskSpace": { "status": "UP", "details": { "free": "42GB" } }
  }
}

Deployment Strategies

How do you update your app without users noticing? There are four common strategies. Think of it like renovating a restaurant — do you close for a week, or keep serving while you remodel one room at a time?

1. Rolling Update

Replace instances one at a time. Old and new versions run side by side briefly.

Time 0: [v1] [v1] [v1] [v1]    ← 4 instances running v1
Time 1: [v2] [v1] [v1] [v1]    ← Replace first instance
Time 2: [v2] [v2] [v1] [v1]    ← Replace second
Time 3: [v2] [v2] [v2] [v1]    ← Replace third
Time 4: [v2] [v2] [v2] [v2]    ← All running v2 ✓

2. Blue-Green Deployment

Run two identical environments: Blue (current) and Green (new). Deploy to Green, test it, then switch all traffic from Blue to Green instantly.

Step 1: Blue (v1) ← ALL TRAFFIC       Green (v2) testing...
Step 2: Blue (v1)                      Green (v2) ← ALL TRAFFIC (switch!)
Step 3: Blue (standby for rollback)    Green (v2) ← serving users

Rollback? Switch traffic back to Blue in seconds!

3. Canary Deployment

Send a small percentage of traffic (say 5%) to the new version. Monitor for errors. If everything looks good, gradually increase to 100%.

Step 1: v1 ← 100% traffic       v2 ← 0%
Step 2: v1 ← 95% traffic        v2 ← 5%   (canary — watching metrics)
Step 3: v1 ← 70% traffic        v2 ← 30%  (metrics look good!)
Step 4: v1 ← 0%                 v2 ← 100% (full rollout ✓)

If errors spike at 5%: kill the canary, 100% back to v1. Only 5% of users affected.

4. Recreate

Stop everything, deploy new version, start everything. Simple but causes downtime.

Step 1: [v1] [v1] [v1] ← running
Step 2: [  ] [  ] [  ] ← all stopped (DOWNTIME!)
Step 3: [v2] [v2] [v2] ← all started with new version

Deployment Strategy Comparison

StrategyDowntimeRiskRollback SpeedResource CostBest For
RollingZeroMediumSlow (roll back one by one)Low (same infra)Most applications (default)
Blue-GreenZeroLowInstant (switch back)High (double infra)Critical apps needing instant rollback
CanaryZeroVery LowFast (kill canary)Low-MediumHigh-traffic apps, new risky features
RecreateYesHighSlow (full redeploy)LowDev/staging, or apps tolerating downtime

The Observability Stack — Metrics, Logs, Traces, Alerts

Observability answers three questions: What happened? (logs), What is happening now? (metrics), and Why did it happen? (traces). Think of it like a doctor: symptoms (metrics), medical history (logs), and an MRI scan (traces).

Metrics: Micrometer → Prometheus → Grafana

How it works:
┌──────────────┐    scrape /metrics    ┌────────────┐    dashboards    ┌─────────┐
│  Spring Boot │ ────────every 15s───→ │ Prometheus │ ──────────────→ │ Grafana │
│  (Micrometer)│                       │ (time-series│                │ (charts)│
└──────────────┘                       │  database)  │                └─────────┘
                                       └────────────┘
// Custom metrics in your code
@Service
@RequiredArgsConstructor
public class OrderMetrics {
    private final MeterRegistry meterRegistry;

    public void recordOrderPlaced(String orderType) {
        // Counter — goes up only (total orders placed)
        meterRegistry.counter("orders.placed", "type", orderType).increment();
    }

    public void recordOrderProcessingTime(long millis) {
        // Timer — tracks duration distribution
        meterRegistry.timer("orders.processing.time")
            .record(millis, TimeUnit.MILLISECONDS);
    }

    public void recordActiveOrders(int count) {
        // Gauge — goes up and down (current value)
        meterRegistry.gauge("orders.active.count", count);
    }
}

Logging: SLF4J → JSON → ELK Stack

Structured JSON logs are machine-readable. The ELK stack (Elasticsearch, Logstash, Kibana) collects, indexes, and visualizes them.

How it works:
┌────────────┐   JSON logs   ┌──────────┐   index    ┌───────────────┐  search   ┌────────┐
│ Spring Boot│ ─────────────→│ Logstash │ ────────→ │ Elasticsearch │ ────────→│ Kibana │
│  (SLF4J)   │  (structured) │ (parse)  │           │   (store)     │          │(search)│
└────────────┘               └──────────┘           └───────────────┘          └────────┘
// logback-spring.xml — JSON structured logging
<configuration>
  <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
      <includeMdcKeyName>traceId</includeMdcKeyName>
      <includeMdcKeyName>spanId</includeMdcKeyName>
      <includeMdcKeyName>userId</includeMdcKeyName>
    </encoder>
  </appender>
  <root level="INFO">
    <appender-ref ref="JSON" />
  </root>
</configuration>

// Output — one JSON object per log line:
{
  "timestamp": "2026-04-08T10:30:00.123Z",
  "level": "INFO",
  "logger": "com.example.OrderService",
  "message": "Order placed successfully",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "userId": "user-42",
  "orderId": "order-789",
  "amount": 49.99
}

Tracing: Micrometer Tracing → Zipkin

A trace follows a single request as it travels through multiple services. Each service adds a "span" (its piece of the journey). You can see exactly where time was spent.

How it works:
┌──────────┐     ┌──────────┐     ┌──────────┐
│ API      │────→│ Order    │────→│ Payment  │
│ Gateway  │     │ Service  │     │ Service  │
│ span: 2ms│     │ span:45ms│     │ span:30ms│
└──────────┘     └──────────┘     └──────────┘
     │                │                │
     └────────────────┴────────────────┘
                      │
                      ▼
               ┌────────────┐
               │   Zipkin    │
               │  (trace UI) │
               └────────────┘

Trace ID: abc-123 (same across all services)
Total time: 77ms
Bottleneck: Payment Service (30ms) — investigate!
// application.yml — Tracing configuration
management:
  tracing:
    sampling:
      probability: 1.0    # Sample 100% in dev, 10% (0.1) in prod

spring:
  application:
    name: order-service

// pom.xml dependencies:
// micrometer-tracing-bridge-brave
// zipkin-reporter-brave

Alerting: Prometheus AlertManager → Slack/Email

How it works:
┌────────────┐  rule violated  ┌──────────────┐  notification  ┌───────────┐
│ Prometheus │ ──────────────→ │ AlertManager │ ─────────────→ │  Slack /  │
│  (rules)   │                 │  (routing)   │                │  Email    │
└────────────┘                 └──────────────┘                └───────────┘
# alert-rules.yml — Prometheus alerting rules
groups:
  - name: application-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "More than 10 errors/sec for 2 minutes"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} is DOWN"

      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile response time > 2 seconds"
# alertmanager.yml — Route alerts to Slack
global:
  resolve_timeout: 5m

route:
  group_by: [alertname, severity]
  group_wait: 10s
  group_interval: 5m
  receiver: slack-notifications

receivers:
  - name: slack-notifications
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
        send_resolved: true

Complete Observability Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     YOUR APPLICATION                            │
│                                                                 │
│  Micrometer ──→ /actuator/prometheus ──→ Prometheus ──→ Grafana │
│  (metrics)                               (store)       (charts) │
│                                                                 │
│  SLF4J ──→ JSON logs ──→ Logstash ──→ Elasticsearch ──→ Kibana │
│  (logging)                (parse)      (index)         (search) │
│                                                                 │
│  Micrometer ──→ Trace spans ──→ Zipkin                          │
│  Tracing        (propagated)   (trace UI)                       │
│                                                                 │
│  Prometheus ──→ Alert rules ──→ AlertManager ──→ Slack/Email    │
│  (thresholds)                  (routing)       (notification)   │
└─────────────────────────────────────────────────────────────────┘

Frequently Asked Questions

1. When should I use a Circuit Breaker vs a Rate Limiter?

Use a Rate Limiter to protect your own service from being overwhelmed by too many incoming requests (like a bouncer at a club). Use a Circuit Breaker to protect your service from wasting time calling a downstream service that is broken (like unplugging a dead appliance so it doesn't trip your house breaker). In production, use both together: rate limit what comes in, circuit break what goes out.

2. What is the difference between Kafka and Spring Events?

Spring Events work inside a single application — like passing a note to someone in the same room. Kafka works across multiple applications running on different servers — like sending a letter through the post office. Use Spring Events for in-process decoupling (e.g., sending an email after order completion). Use Kafka when different services on different machines need to communicate asynchronously, and you need message durability (messages survive crashes).

3. How does the Outbox Pattern prevent data inconsistency?

Without the outbox pattern, you save to the database and then publish to Kafka — two separate operations. If your app crashes between them, the database has the data but no event was published. The outbox pattern writes both the business data and the event to the same database transaction. Either both are saved or neither is. A separate background process then reads the outbox table and publishes to Kafka. Even if the publisher crashes, the event is still safely in the database, waiting to be published on the next run.

4. What is the best deployment strategy for a production application?

Rolling update is the best default choice — it has zero downtime, uses the same infrastructure, and is built into Kubernetes. For mission-critical apps where you need instant rollback, use Blue-Green (but it costs double the infrastructure). For high-traffic apps where you want to test new features safely, use Canary (route 5% of traffic first, then gradually increase). Recreate is only for development environments or apps that can tolerate a few minutes of downtime.

5. How do Metrics, Logs, and Traces work together?

Think of investigating a production issue like being a detective. Metrics are the alarm that tells you something is wrong ("error rate spiked at 2:30 PM"). Logs give you the details ("NullPointerException in OrderService.java line 42 for user-789"). Traces show you the full journey of a request across services ("this request spent 200ms in order-service, then 3000ms stuck in payment-service — that is the bottleneck!"). Together, they answer: what happened, where it happened, and why it happened.


Comments


Login to join the conversation.

Loading comments…

More from Innovation