docker, kubernetes in aws 😡


So, I had a problem. I decided to solve it with docker, now I had two problems. But, I’m a glutton for punishment, so I decided to orchestrate my containers with kubernetes, and ended up with 99 problems.

What was I trying to solve? I wanted to have scalable self-hosted Azure DevOps agents. The examples here and here got me started. Sadly, most of the examples were in Azure Kubernetes service, and by all accounts, Azure just seems to know how to do services. AWS on the other hand, well that was an acute pain.

AWS’ kubernetes offering is called Elastic Kubernetes Service (eks), and by default, you’re expected to configure your node groups with ec2 instances. You see, AWS loves ec2 instances. The addiction runs deep. But as they saw: we move! I created my cluster to use Fargate.

Creating a docker image

To start with, I built the docker image based on the examples, built and ran the container on my local machine (macOs), sweet. Agents were registered with Azure DevOps (on prem) and Bob’s your uncle. Tried to host this image with AWS and I got the following error:

standard_init_linux.go:228: exec user process caused: no such file or directory

Google results point to crlf and lf and dos2unix utilities. Let me just say, I lost some time to this red herring. After searching just for the error code “standard_init_linux.go:228” we realised the image was the culprit. You seethe Azure DevOps examples are based on Ubuntu based containers, but Fargate only supports Amazon Linux based containers. On their documentation page they state:

To address all this, I had to start with a new dockerfile:

FROM amazonlinux:latest # yup

# To make it easier for build and release pipelines to run apt-get,
# configure apt to not require confirmation (assume the -y argument by default)
# ENV DEBIAN_FRONTEND=noninteractive

RUN yum update -y && yum install -y git cmake gcc make \
    openssl-devel libssh2-devel openssh-server \
    ca-certificates curl wget jq tar libicu67 zip unzip \
    openssl tree \
    git-daemon

RUN yum install -y java-11-amazon-corretto-headless... #WHY?!?!?!
#....

All the apt-get install had to be replaced with yum install, where possible. I installed dotnet core sdk, and then had some weird behaviour with yum install -y java-11-openjdk-devel with the package not found. Eventually I settled for yum install -y java-11-amazon-corretto-headless and this worked consistently.

Docker image is built, and pushed to Amazon’s Elastic Container Register (ecr). Now on to hosting.

Creating a “serverless” kubernetes cluster in AWS

The cluster creation came with its own challenges. I initially used eksctl, which creates and executes a dynamic CloudFormation template. However, since I wanted a repeatable process, I copied the resulting template and tweaked it before applying it.

Now I have an image, and with my kubernetes manifest I ran:

kubectl apply -f /tmp/azdo.yaml --namespace=azure-devops

No dice. error: You must be logged in to the server (Unauthorized). Turns out that if you create the cluster as a different iam role, forget about connecting to it with kubectl from another role. Just forget it. So tear it all down and deploy it with the same role you intend to use (or one that can assume the role).

After connecting, the pods are all stuck in pending state, including all the coredns pods. Remember the addiction to ec2 instances? No coredns, to agents. Well, Amazon says under the section Update CoreDNS:

So, you need to run the following to fix that:

kubectl patch deployment coredns \
    -n kube-system \
    --type json \
    -p='[{"op": "remove", "path": "/spec/template/metadata/annotations/eks.amazonaws.com~1compute-type"}]'

Then delete and recreate the pods: kubectl rollout restart -n kube-system deployment coredns

After this, I was finally able to deploy the pods. Oh, did I mention I also got ecs working, but that’s a story for another day. I use the metadata from the pod name or ecs task name to name my agents for ease of identification.

Below is my cloudformation template for creating the cluster:

AWSTemplateFormatVersion: 2010-09-09
Description: >-
  EKS cluster (dedicated VPC: false, dedicated IAM: true) [created and managed
  by eti@p82.com ]
Parameters:
  1Name: # I always use this to have the stack name easily accessible :)
    Description: Name
    Type: String
    Default: p80-solutions-azure-devops-eks-cluster
Mappings:
  ServicePrincipalPartitionMap:
    aws:
      EC2: ec2.amazonaws.com
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-cn:
      EC2: ec2.amazonaws.com.cn
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
    aws-us-gov:
      EC2: ec2.amazonaws.com
      EKS: eks.amazonaws.com
      EKSFargatePods: eks-fargate-pods.amazonaws.com
  Tags:
    Name:
      Value: "my-cluster"
    Appid:
      Value: "p80"
    Appname:
      Value: "p80-solutions"
    Costcenter:
      Value: "12345"
    Owner:
      Value: "eti@p82.com"
Resources:
  ClusterSharedNodeSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Communication between all nodes in the cluster
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}/ClusterSharedNodeSecurityGroup'
        - Key: Appid
          Value: !FindInMap 
            - Tags
            - Appid
            - Value
        - Key: Appname
          Value: !FindInMap 
            - Tags
            - Appname
            - Value
        - Key: Costcenter
          Value: !FindInMap 
            - Tags
            - Costcenter
            - Value
        - Key: Owner
          Value: !FindInMap 
            - Tags
            - Owner
            - Value
      VpcId: vpc-123456789
  ControlPlane:
    Type: 'AWS::EKS::Cluster'
    Properties:
      KubernetesNetworkConfig:
        IpFamily: ipv4
      Name: p80-solutions-devops-cluster
      ResourcesVpcConfig:
        EndpointPrivateAccess: true
        EndpointPublicAccess: true
        SecurityGroupIds:
          - !Ref ControlPlaneSecurityGroup
          - sg-123456
          - sg-789012
          - sg-345678
        SubnetIds:
          - subnet-12345678
          - subnet-56781234
          - subnet-90121234
      EncryptionConfig: 
      - Provider:
          KeyArn: 'arn:aws:kms:eu-central-1:123456789012:key/27d5e1dc-fd51-4a2a-bdc4-932f5e83bcce'
        Resources: 
        - secrets
      RoleArn: 'arn:aws:iam::123456789012:role/p80-solutions-eks-role'
      Logging:
        ClusterLogging:
          EnabledTypes:
            - Type: api
            - Type: audit
            - Type: controllerManager
            - Type: scheduler
            - Type: authenticator
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}/ControlPlane'       
        - Key: Appid
          Value: !FindInMap 
            - Tags
            - Appid
            - Value
        - Key: Appname
          Value: !FindInMap 
            - Tags
            - Appname
            - Value
        - Key: Costcenter
          Value: !FindInMap 
            - Tags
            - Costcenter
            - Value
        - Key: Owner
          Value: !FindInMap 
            - Tags
            - Owner
            - Value
        - Key: purpose
          Value: 'Azure DevOps'
      Version: '1.22'
  ControlPlaneSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Communication between the control plane and worker nodegroups
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}/ControlPlaneSecurityGroup'
        - Key: Appid
          Value: !FindInMap 
            - Tags
            - Appid
            - Value
        - Key: Appname
          Value: !FindInMap 
            - Tags
            - Appname
            - Value
        - Key: Costcenter
          Value: !FindInMap 
            - Tags
            - Costcenter
            - Value
        - Key: Owner
          Value: !FindInMap 
            - Tags
            - Owner
            - Value
      VpcId: vpc-123456789
  IngressDefaultClusterToNodeSG:
    Type: 'AWS::EC2::SecurityGroupIngress'
    Properties:
      Description: >-
        Allow managed and unmanaged nodes to communicate with each other (all
        ports)
      FromPort: 0
      GroupId: !Ref ClusterSharedNodeSecurityGroup
      IpProtocol: '-1'
      SourceSecurityGroupId: !GetAtt 
        - ControlPlane
        - ClusterSecurityGroupId
      ToPort: 65535
  IngressInterNodeGroupSG:
    Type: 'AWS::EC2::SecurityGroupIngress'
    Properties:
      Description: Allow nodes to communicate with each other (all ports)
      FromPort: 0
      GroupId: !Ref ClusterSharedNodeSecurityGroup
      IpProtocol: '-1'
      SourceSecurityGroupId: !Ref ClusterSharedNodeSecurityGroup
      ToPort: 65535
  IngressNodeToDefaultClusterSG:
    Type: 'AWS::EC2::SecurityGroupIngress'
    Properties:
      Description: Allow unmanaged nodes to communicate with control plane (all ports)
      FromPort: 0
      GroupId: !GetAtt 
        - ControlPlane
        - ClusterSecurityGroupId
      IpProtocol: '-1'
      SourceSecurityGroupId: !Ref ClusterSharedNodeSecurityGroup
      ToPort: 65535
      
Outputs:
  ARN:
    Value: !GetAtt 
      - ControlPlane
      - Arn
    Export:
      Name: !Sub '${AWS::StackName}::ARN'
  CertificateAuthorityData:
    Value: !GetAtt 
      - ControlPlane
      - CertificateAuthorityData
  ClusterSecurityGroupId:
    Value: !GetAtt 
      - ControlPlane
      - ClusterSecurityGroupId
    Export:
      Name: !Sub '${AWS::StackName}::ClusterSecurityGroupId'
  ClusterStackName:
    Value: !Ref 'AWS::StackName'
  Endpoint:
    Value: !GetAtt 
      - ControlPlane
      - Endpoint
    Export:
      Name: !Sub '${AWS::StackName}::Endpoint'
  FeatureNATMode:
    Value: Disable
  SecurityGroup:
    Value: !Ref ControlPlaneSecurityGroup
    Export:
      Name: !Sub '${AWS::StackName}::SecurityGroup'
  SharedNodeSecurityGroup:
    Value: !Ref ClusterSharedNodeSecurityGroup
    Export:
      Name: !Sub '${AWS::StackName}::SharedNodeSecurityGroup'
  SubnetsPrivate:
    Value: !Join 
      - ','
      - - subnet-12345678
        - subnet-56781234
    Export:
      Name: !Sub '${AWS::StackName}::SubnetsPrivate'
  SubnetsPublic:
    Value: subnet-90121234
    Export:
      Name: !Sub '${AWS::StackName}::SubnetsPublic'
  VPC:
    Value: vpc-123456789
    Export:
      Name: !Sub '${AWS::StackName}::VPC'