Autoscaling Azure DevOps Agents with Container Apps and KEDA

How I built a zero-to-ten self-hosted agent pool that scales on demand, costs nothing when idle, and includes a warm standby for instant job pickup

What if your CI/CD agents only existed when you needed them? No idle VMs burning money overnight. No cold starts when the morning rush hits. Just agents that materialise when pipelines queue and vanish when they’re done.

This is the promise of event-driven autoscaling - and with Azure Container Apps Jobs and KEDA, it’s surprisingly straightforward to achieve.

Container Apps Agents Architecture

The Problem with Traditional Self-Hosted Agents

Self-hosted Azure DevOps agents typically run on VMs or containers that are always on. You pay for them 24/7, even when no pipelines are running. For a small team, this might be a few hundred dollars a month. For larger organisations with multiple pools, it adds up fast.

Microsoft-hosted agents solve the idle cost problem but introduce others:

  • Cold start delays of 30-60 seconds per job
  • Limited customisation - can’t install proprietary tools
  • No network access to private resources without complex workarounds

What I wanted was the best of both worlds: zero cost when idle, custom tooling, and fast job pickup.

The Solution: Container Apps Jobs with KEDA

Azure Container Apps has a feature called Jobs that’s perfect for this use case. Unlike regular Container Apps (which run continuously), Jobs spin up on demand, execute, and terminate.

KEDA (Kubernetes Event-Driven Autoscaling) is built into Container Apps and includes a scaler specifically for Azure Pipelines. It monitors your agent pool’s queue and triggers new containers when jobs are waiting.

How It Works

Pipeline queues job


   KEDA polls Azure DevOps API (every 10-30s)


   Queue length > 0 detected


   Container App Job triggered


   Agent container starts (~30-60s)


   Agent registers to pool


   Agent picks up job and runs it


   Job completes → Agent unregisters → Container exits


   Queue empty → No new containers started

The entire lifecycle is automated. KEDA handles the scaling decisions, Azure manages the infrastructure, and you only pay for actual execution time.

Infrastructure as Code

The infrastructure is defined in Terraform. Here are the key components:

Container Apps Environment

First, we need an environment to host our jobs:

resource "azurerm_container_app_environment" "agents" {
  name                       = "cae-azp-agents"
  location                   = azurerm_resource_group.agents.location
  resource_group_name        = azurerm_resource_group.agents.name
  log_analytics_workspace_id = azurerm_log_analytics_workspace.agents.id
}

The Agent Job with KEDA Scaling

The magic happens in the Container App Job definition:

resource "azurerm_container_app_job" "agents" {
  name                         = "azp-agent-job"
  container_app_environment_id = azurerm_container_app_environment.agents.id
  
  replica_timeout_in_seconds = 1800  # 30 min max job duration

  event_trigger_config {
    scale {
      min_executions              = 0   # Scale to zero!
      max_executions              = 10  # Cap concurrent agents
      polling_interval_in_seconds = 10  # Check queue frequently

      rules {
        name             = "azure-pipelines"
        custom_rule_type = "azure-pipelines"

        metadata = {
          poolName                   = "p80-uksouth"
          targetPipelinesQueueLength = "1"  # 1 agent per job
        }
      }
    }
  }
  # ... container template with agent image
}

The targetPipelinesQueueLength = "1" means KEDA will spawn one agent per queued job. If 5 jobs are waiting, 5 containers spin up (up to max_executions).

The N+1 Warm Standby Pattern

There’s one problem with pure event-driven scaling: cold starts. Every time the queue goes from empty to having jobs, there’s a 30-60 second delay while the container starts and the agent registers.

For bursty workloads (like when multiple PRs merge in quick succession), this adds up. The solution is an N+1 standby agent - a warm spare that’s ready to grab the next job instantly.

How N+1 Works

When a job is queued, both agents spin up:

Job Queued

    ├──► Job Agent (Container App Job)
    │       • Runs the pipeline job
    │       • Exits when done (--once mode)

    └──► Standby Agent (Container App)
            • Registers and waits
            • Stays running (persistent mode)
            • Picks up NEXT job instantly
            • Scales to 0 after 5 min idle

The standby uses a regular Container App (not a Job) with the same KEDA scale rule. It scales up when the queue has jobs and scales down after 5 minutes of inactivity.

resource "azurerm_container_app" "standby" {
  name          = "azp-agent-standby"
  revision_mode = "Single"

  template {
    min_replicas = 0
    max_replicas = 1

    custom_scale_rule {
      name             = "azure-pipelines-standby"
      custom_rule_type = "azure-pipelines"
      metadata = {
        poolName                   = "p80-uksouth"
        targetPipelinesQueueLength = "1"
      }
    }
  }
}

The difference is in the agent’s run mode. Job agents run with --once and exit after completing a single job. The standby runs persistently, handling multiple jobs while it’s up.

The Placeholder Agent Trick

There’s a quirk with Azure DevOps: if an agent pool is completely empty (no agents, not even offline ones), jobs targeting that pool fail immediately rather than waiting for an agent.

The fix is a placeholder agent - an offline agent that exists just to make the pool “valid”. We register it once using a manual Container App Job:

resource "azurerm_container_app_job" "placeholder" {
  name = "azp-agent-placeholder"
  
  manual_trigger_config {
    parallelism = 1
  }
  
  template {
    container {
      env {
        name  = "AZP_PLACEHOLDER"
        value = "1"
      }
    }
  }
}

When AZP_PLACEHOLDER=1, the start script registers the agent and exits immediately without cleaning up. The agent stays in the pool as “offline” forever, which is enough to make Azure DevOps queue jobs properly.

The Agent Docker Image

The agent runs Ubuntu 24.04 with the tools we need:

  • .NET 10 SDK
  • Node.js 20
  • Azure CLI
  • Terraform
  • PowerShell

The key to making it work with Container Apps Jobs is the start.sh script that handles three modes:

  1. One-shot mode (default): Run one job and exit (--once)
  2. Persistent mode (AZP_RUN_ONCE=false): Keep running for the standby
  3. Placeholder mode (AZP_PLACEHOLDER=1): Register and exit without cleanup

Cost Analysis

Here’s what this architecture costs for a small team:

ScenarioMonthly Cost
Idle (no pipelines)$0
100 hours of pipeline execution~$5-10
500 hours of pipeline execution~$25-50

Compare this to running even a single B2s VM 24/7: ~$30/month. And that’s without autoscaling or redundancy.

The N+1 standby adds ~$0.05-0.10/hour during active periods, but saves 30-60 seconds on every subsequent job in a burst.

Deployment Pipeline

The infrastructure deploys via Azure DevOps (using Microsoft-hosted agents for bootstrapping):

  1. Terraform Apply - Creates Container Apps environment, ACR, jobs
  2. Build Agent Image - Builds Dockerfile in ACR
  3. Register Placeholder - Runs the placeholder job once

After the first deployment, the pool is ready. Queue a job and watch KEDA spin up agents automatically.

Lessons Learned

KEDA polling interval matters. The default 30 seconds means jobs can wait up to 30s before an agent even starts spinning up. Reduce to 10s for faster response.

Placeholder agents are essential. Without at least one agent (even offline) in the pool, Azure DevOps rejects jobs immediately.

Cleanup traps need care. The agent’s default behaviour is to unregister on exit. For placeholders, you must disable the cleanup trap or the agent disappears when the container stops.

Container Apps vs Container Apps Jobs. Use Jobs for one-shot workloads (like pipeline agents), regular Container Apps for persistent services (like the standby).

Conclusion

Event-driven autoscaling for CI/CD agents is one of those “why didn’t I do this sooner” improvements. The infrastructure is simple, the cost savings are real, and the developer experience improves with faster job pickup during active periods.

The full Terraform code is available in my devops repository.


Have questions about this architecture? Find me on LinkedIn or drop me a message via the contact form.