Techniques to Debug Memory Leaks in Spring Boot Scheduled Apps on Docker (Without Guessing)

Source: Techniques to Debug Memory Leaks in Spring Boot Scheduled Apps on Docker (Without Guessing)

1. The “slow boil” problem I keep seeing in scheduled containers

When a Spring Boot app only wakes up on a schedule, it can look healthy for hours or days, then suddenly start crawling, getting OOM-killed, or restarting in Docker like it’s playing whack-a-mole with the orchestrator. The tricky part is that scheduled workloads don’t leak like a busy web API does; they leak in pulses. A job runs, allocates a lot, forgets to release some of it, then sleeps—so the heap graph looks calm until it doesn’t. That’s why I never start with “tuning the JVM.” I start with proving whether memory is retained across runs, and what is being retained.

1.1 Why Docker makes leaks feel more random than they are

In containers, memory limits are real walls. If the JVM believes it has more headroom than the container allows, it may size the heap, metaspace, and native allocations in a way that looks fine on bare metal but becomes fatal inside Docker. Even with container-aware JVMs, I still see the same pattern: heap gradually climbs, GC becomes more frequent, pause times increase, and then the container gets OOM-killed before I even see a Java OutOfMemoryError. When that happens, I treat it as two separate questions: “Is my process using too much total memory?” and “Is the Java heap retaining objects between schedules?” Because if the heap is stable but RSS grows, the culprit might be native memory, direct buffers, thread stacks, or even a logging/metrics sidecar configuration.

1.2 The fastest leak test: does used heap baseline rise after each run

Before I open any profiler, I first try to answer one simple thing: after each scheduled execution, does the “floor” of used heap come back down to roughly the same baseline, or does it stair-step upward? If it stair-steps upward, that’s classic retention. If the heap oscillates but the container memory grows, I shift focus to native memory, threads, or direct buffers. This one observation prevents me from wasting hours tuning GC for a bug that is basically “I accidentally kept references forever.”

1.3 What “leak” usually means in scheduled Spring Boot jobs

Most memory leaks in Java aren’t “leaking” in the C sense; they’re references I didn’t mean to keep. In scheduled apps, the most common offenders I personally see are in-memory caches without eviction, static collections used as “temporary” storage, accumulating log/trace context, ThreadLocal misuse in pool threads, response bodies read into giant strings and stored for debugging, and library-level caches that grow based on input cardinality, such as tags/labels, dynamic class generation, or unbounded metrics dimensions.

2. Methods I use to reproduce and isolate the leak inside Docker

The goal is to stop treating the container as a black box. I want a reproducible run, a memory timeline, and at least one artifact I can inspect offline.

When I’m chasing retention, I want heap dumps on OOM, GC logs, and a clear max heap that respects the container. I also want predictable scheduling frequency so the leak reproduces quickly. I usually run the job more frequently in a test environment, even every 10–30 seconds, because waiting 12 hours for a leak to show up is a great way to become best friends with despair.

2.2 The two artifacts that change everything: GC log + heap dump

GC logs tell me whether memory pressure is real and whether the JVM can reclaim what it allocates. A heap dump tells me what is still referenced and who is holding it. If I only look at GC logs, I can misread legitimate cache growth as a leak, or miss a leak that hides in old-gen. If I only look at heap dumps, I might not understand whether the growth is monotonic or periodic. Together, they give me a story, not a snapshot.

2.3 Image references I like to keep nearby while debugging

Here are a few visual references that are handy when I explain the workflow to teammates. I’m putting them as plain URLs so you can paste them into your browser directly.

JVM memory areas diagram (heap/metaspace/threads):
https://www.baeldung.com/wp-content/uploads/2018/07/jvm-memory-structure.png

GC cycles visual overview (young/old collections concept):
https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/images/gc_phases.png

Docker memory limit vs process RSS (conceptual):
https://docs.docker.com/config/containers/resource_constraints/images/memory-limit.png

3. A real Java example: a scheduled job that “accidentally remembers everything”

To keep this practical, I’m going to show a leak that looks innocent in code review, works fine in dev, and then quietly eats your container in production. The pattern is simple: I keep per-run data in a static structure “just for debugging,” and because scheduled jobs run forever, “just for debugging” becomes “permanent archival of my own pain.”

3.1 The leaking version

In this example, I simulate a scheduled sync that downloads payloads, parses them, and stores a “trace” so I can inspect recent runs. The bug is that the trace store is unbounded and stores large strings, and it is static so it survives forever.

package com.example.leakdemo;

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.time.Instant;
import java.util.Map;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;

@Component
public class PartnerSyncJob {

    /*
      Bug: unbounded, static retention across the whole app lifetime.
      In a scheduled app, lifetime == "forever", so this grows until OOM.
     /
    private static final Map<string, string=""> DEBUG_RUNPAYLOADS = new ConcurrentHashMap<>();

    @Scheduled(fixedDelayString = "${sync.fixedDelayMs:15000}")
    public void sync() {
        String runId = Instant.now() + "" + UUID.randomUUID();

        // Simulate a large payload per run (e.g., JSON, CSV, HTML, etc.)
        String payload = generateLargePayload();

        // Simulate processing
        int processed = payload.hashCode(); // pretend parsing & saving happened

        // "Just keep it for debugging" -> memory retention across runs
        DEBUG_RUN_PAYLOADS.put(runId, payload);

        System.out.println("Run " + runId + " processed=" + processed
                + ", debugStoreSize=" + DEBUG_RUN_PAYLOADS.size());
    }

    private String generateLargePayload() {
        // ~5 MB string (roughly), repeated forever => container says bye.
        String chunk = "x".repeat(1024);
        StringBuilder sb = new StringBuilder(5  1024  1024);
        for (int i = 0; i < 5 * 1024; i++) {
            sb.append(chunk);
        }
        return sb.toString();
    }
}

3.2 Why this leaks in a way GC cannot “fix”

Garbage collection only reclaims objects that are no longer reachable. Here, every payload string is reachable through DEBUG_RUN_PAYLOADS. Even if the scheduled method ends, those payloads are still referenced, so they sit in old-gen. The job runs again, creates another multi-MB payload, stores it again, and now old-gen grows. Eventually, the JVM tries harder GC cycles, CPU spikes, and either I get a Java heap OOM or the container gets killed first. If you’ve ever seen a scheduled container that “runs fine but slowly gets heavier,” this is the exact logic trap: the code does exactly what I asked it to do, not what I meant.

3.3 The fixed version: bounded retention + safer debugging

If I really need “recent runs,” I store only a small bounded window, and I store summaries rather than full payloads. If I must keep samples, I keep a small snippet. Most importantly, I ensure there is an eviction policy.

package com.example.leakdemo;

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.time.Instant;
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.UUID;

@Component
public class PartnerSyncJobFixed {

    /*
      Keep only last N run summaries, not whole payloads.
      This makes memory usage stable across infinite scheduling.
     /
    private static final int MAX_RUN_HISTORY = 50;
    private static final Deque<runsummary> RUN_HISTORY = new ArrayDeque<>(MAX_RUNHISTORY);

    @Scheduled(fixedDelayString = "${sync.fixedDelayMs:15000}")
    public void sync() {
        String runId = Instant.now() + "" + UUID.randomUUID();

        String payload = generateLargePayload();
        int processed = payload.hashCode();

        // Store only a snippet and size metadata.
        String snippet = payload.substring(0, Math.min(payload.length(), 2000));
        addSummary(new RunSummary(runId, payload.length(), processed, snippet));

        System.out.println("Run " + runId + " processed=" + processed
                + ", historySize=" + RUN_HISTORY.size());
    }

    private void addSummary(RunSummary summary) {
        // Simple bounded deque eviction; stable memory footprint.
        synchronized (RUN_HISTORY) {
            if (RUN_HISTORY.size() == MAX_RUN_HISTORY) {
                RUN_HISTORY.removeFirst();
            }
            RUN_HISTORY.addLast(summary);
        }
    }

    private String generateLargePayload() {
        String chunk = "x".repeat(1024);
        StringBuilder sb = new StringBuilder(5  1024  1024);
        for (int i = 0; i < 5 * 1024; i++) {
            sb.append(chunk);
        }
        return sb.toString();
    }

    private record RunSummary(String runId, int payloadChars, int processedHash, String payloadSnippet) {}
}

3.4 What I just changed, in practical terms

I removed the unbounded map that grows linearly with time. Instead of storing full payloads forever, I store a bounded number of lightweight summaries. That means the heap can return to a baseline after each run because the job is no longer keeping a growing chain of references. If I still need full payloads for troubleshooting, I write them to an external store with retention policies, not to my heap. Containers are for running code, not for becoming accidental museums.

4. Debug workflow: how I prove what’s retaining memory

Once I can reproduce the issue, I move through a repeatable pipeline: observe, capture, analyze, fix, validate.

4.1 Observe memory the way Docker sees it, not the way I wish it behaved

Inside Docker, I care about process RSS and container memory. If container memory steadily rises but Java heap doesn’t, it’s a hint that the issue is outside heap. If both rise, retention is likely. I also watch thread count because scheduled apps sometimes create new threads per run through misconfigured executors, and each thread has stack memory plus associated objects, which makes a container grow in a way that looks like a leak.

4.2 Capture a heap dump at the right moment

Heap dumps are most useful when taken after several scheduled cycles, right when the “baseline” should be stable but isn’t. In practice, I take one dump early and one dump later, then compare dominator trees. The object that grows between dumps is usually the answer, and the “path to GC root” tells me which code path is holding it.

4.3 Read the heap dump like a detective, not like a tourist

When I open a heap dump in Eclipse MAT or VisualVM, I don’t start by searching for my own package name. I start with the biggest retained sizes and ask “what is the GC root?” If the GC root is a static field, it often means a cache or a singleton. If the GC root is a thread, I suspect ThreadLocal, executor queues, or blocked references waiting in a queue. If the GC root is a classloader, I suspect dynamic class generation or redeploy/reload issues. This is also where scheduled apps are sneaky: a queue can grow slowly if scheduling creates backpressure and tasks overlap, leaving run artifacts referenced in executor structures.

5. Docker + Scheduled apps: the leak patterns I see most often

This is where debugging becomes less mystical and more like pattern matching, because scheduled apps are predictable beasts.

5.1 Overlapping executions that accumulate work-in-memory

If a job runs every minute but sometimes takes two minutes, overlap happens. Overlap is not just “two executions at once”; it becomes “more queued tasks than I can finish.” That queue retains references to task arguments, payloads, closures, and sometimes whole result sets. The app isn’t leaking by reference mistake; it’s leaking by backlog. In that situation, fixing memory means fixing scheduling semantics, concurrency, and backpressure, not GC flags.

5.2 Unbounded caches, especially when keys are high-cardinality

If I build a cache keyed by something like userId, partnerId, or request signature, but that domain is effectively infinite over time, the cache becomes a slow, polite OOM. In scheduled apps, it’s common to cache “the last response per partner” or “the last successful payload per day,” and a tiny design oversight turns it into “the last payload for every run forever.” If I need caching, I treat eviction like a first-class requirement, not an optional improvement.

5.3 Metrics and logging labels that explode memory silently

A classic production-only leak comes from label cardinality. If I include dynamic IDs in metric tags, the metrics registry can store a growing number of meter instances. That’s not a bug in Micrometer; it’s my tag design. Scheduled jobs are especially vulnerable because they often process many IDs in loops, and one “tag this with entityId” becomes a meter-per-entity retention structure. The result looks like a memory leak but is actually “I created millions of unique meters.”

6. Validation: how I know I actually fixed it

After the fix, I run the same schedule frequency and workload long enough to see multiple cycles. I’m looking for the baseline to stop climbing and for GC behavior to return to normal cadence. I also keep an eye on container RSS because the real-world failure mode is the container memory limit, not a pretty Java exception. If the baseline stays stable, I consider it fixed. If it still rises, I return to the heap dump comparison, because memory doesn’t lie—people do.

7. Content clusters I usually link internally for teams maintaining scheduled Spring Boot jobs

Scheduled memory leaks rarely live alone. When I write documentation for a team, I connect this topic to adjacent pieces so future incidents are faster to resolve, such as JVM memory model basics, GC logging interpretation, executor configuration for Spring scheduling, container memory sizing, and metrics tag cardinality guidelines. That “cluster” approach is what turns one-off firefighting into a stable operating model, which is the only kind of model I enjoy maintaining.

8. Closing thought: the simplest rule that saves the most containers

If a scheduled app runs forever, anything I store in-memory without a cap is effectively a time bomb. The heap is not my debug database, not my archive, not my long-term cache, and definitely not my scrapbook of payloads. When I respect that, memory leaks become a solvable engineering problem instead of a recurring horror franchise.

If you want, comment below with your setup details—your schedule frequency, Docker memory limit, JVM version, and what you observe first (heap stair-step, RSS growth, or thread count growth)—and I’ll help you narrow down the most likely leak pattern.

Techniques to Debug Memory Leaks in Spring Boot Scheduled Apps on Docker (Without Guessing)

1. The “slow boil” problem I keep seeing in scheduled containers

1.1 Why Docker makes leaks feel more random than they are

1.2 The fastest leak test: does used heap baseline rise after each run

1.3 What “leak” usually means in scheduled Spring Boot jobs

2. Methods I use to reproduce and isolate the leak inside Docker

2.1 Run the container with “evidence switches” instead of blind tuning

2.2 The two artifacts that change everything: GC log + heap dump

2.3 Image references I like to keep nearby while debugging

3. A real Java example: a scheduled job that “accidentally remembers everything”

3.1 The leaking version

3.2 Why this leaks in a way GC cannot “fix”

3.3 The fixed version: bounded retention + safer debugging

3.4 What I just changed, in practical terms

4. Debug workflow: how I prove what’s retaining memory

4.1 Observe memory the way Docker sees it, not the way I wish it behaved

4.2 Capture a heap dump at the right moment

4.3 Read the heap dump like a detective, not like a tourist

5. Docker + Scheduled apps: the leak patterns I see most often

5.1 Overlapping executions that accumulate work-in-memory

5.2 Unbounded caches, especially when keys are high-cardinality

5.3 Metrics and logging labels that explode memory silently

6. Validation: how I know I actually fixed it

7. Content clusters I usually link internally for teams maintaining scheduled Spring Boot jobs

8. Closing thought: the simplest rule that saves the most containers

Comments

More from this blog

Reasons TTL Alone Is a Weak Cache Strategy for Frequently Updated Business Data

Techniques: How to design versioned commands so retries stay safe under concurrent modification?

Techniques to Partition Data for Growth Without Breaking Query Simplicity

Methods to Move Cross-Cutting Logic Out of Controllers Without Building a Mystery Box

Reasons Java services get slower after a few hours: How to find thread pool saturation?

Command Palette

1. The “slow boil” problem I keep seeing in scheduled containers

1.1 Why Docker makes leaks feel more random than they are

1.2 The fastest leak test: does used heap baseline rise after each run

1.3 What “leak” usually means in scheduled Spring Boot jobs

2. Methods I use to reproduce and isolate the leak inside Docker

2.1 Run the container with “evidence switches” instead of blind tuning

2.2 The two artifacts that change everything: GC log + heap dump

2.3 Image references I like to keep nearby while debugging

3. A real Java example: a scheduled job that “accidentally remembers everything”

3.1 The leaking version

3.2 Why this leaks in a way GC cannot “fix”

3.3 The fixed version: bounded retention + safer debugging

3.4 What I just changed, in practical terms

4. Debug workflow: how I prove what’s retaining memory

4.1 Observe memory the way Docker sees it, not the way I wish it behaved

4.2 Capture a heap dump at the right moment

4.3 Read the heap dump like a detective, not like a tourist

5. Docker + Scheduled apps: the leak patterns I see most often

5.1 Overlapping executions that accumulate work-in-memory

5.2 Unbounded caches, especially when keys are high-cardinality

5.3 Metrics and logging labels that explode memory silently

6. Validation: how I know I actually fixed it

7. Content clusters I usually link internally for teams maintaining scheduled Spring Boot jobs

8. Closing thought: the simplest rule that saves the most containers

Comments

More from this blog