Bash, Chef, Ansible, Puppet Provision Provision the infrastructure: e.g., EC2 Instances, load balancers, network topology, security groups, IAM permissions, etc.. Task Description Exampl
Trang 1Lessons learned from writing
Trang 4We are trying to build this…
Trang 6If you just read the headlines, it all sounds
Trang 7Kubernetes, Docker, serverless, microservices, infrastructure as code, distributed tracing, big data systems, data warehouses, data lakes,
chaos engineering, zero-trust architecture,
streaming architecture, immutable
infrastructure, service discovery, service
meshes, NoSQL, NewSQL, ChatOps, HugOps,
Trang 8But to me, it doesn’t feel
Trang 13Here’s something we don’t
Trang 14Building production-grade
Trang 18Project Examples Time estimate
Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks
Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks
Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months
Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc 6 – 24 months
Trang 20One trend I love: manage
Trang 21Manual DBA work
Trang 23we’ve created a
reusable library of
Trang 24Primarily written in Terraform, Go,
Trang 25Off-the-shelf, battle-tested solutions for AWS, Docker, VPCs, VPN, MySQL, Postgres, Couchbase, ElasticSearch, Kafka, ZooKeeper,
Trang 26The library is used in production
by hundreds of customers
Trang 27Project Examples Time estimate
Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks
Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks
Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months
Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc 6 – 24 months
Trang 28Project Examples Time estimate
Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks 1 day
Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks 1 day
Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months 1 day
Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc 6 – 24 months 1 day
Trang 30In this talk, I’ll share what we
Trang 31I’m
ybrikman.com
Trang 32Co-founder of
Trang 352
3
4
5
Trang 36Project Examples Time estimate
Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks
Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks
Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months
Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc 6 – 24 months
Trang 38How can it possibly take that
Trang 41Yak shaving: a seemingly
endless series of small tasks you have to do before you
can do what you actually
Trang 44The production-grade
Trang 45Task Description Example tools
Install Install the software binaries and all dependencies Bash, Chef, Ansible, Puppet
Configure
Configure the software at runtime: e.g., configure port settings, file paths, users, leaders, followers, replication, etc
Bash, Chef, Ansible, Puppet
Provision Provision the infrastructure: e.g., EC2 Instances, load balancers, network topology, security groups, IAM
permissions, etc
Terraform, CloudFormation
Deploy Deploy the service on top of the infrastructure Roll out updates with no downtime: e.g., blue-green, rolling, canary
deployments
Scripts, Orchestration tools (ECS, K8S, Nomad)
Trang 46Task Description Example tools
Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening ACM, EBS Volumes, Cognito, Vault, CiS Monitoring Availability metrics, business metrics, app metrics, server, metrics, events, observability, tracing, alerting CloudWatch, DataDog, New Relic, Honeycomb
Logs Rotate logs on disk Aggregate log data to a central location CloudWatch Logs, ELK, Sumo Logic, Papertrail
Trang 47ec2-Task Description Example tools
Networking VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access EIPs, ENIs, VPCs, NACLs, SGs, Route 53,
OpenVPN
High availability Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions. Multi AZ, multi-region, replication, ASGs, ELBs
Scalability Scale up and down in response to load Scale horizontally (more servers) and/or vertically (bigger servers).
ASGs, replication, sharding, caching, divide and conquer
Performance Optimize CPU, memory, disk, network, GPU and usage Query tuning Benchmarking, load testing, profiling Dynatrace, valgrind, VisualVM, ab, Jmeter
Trang 48Task Description Example tools
Cost optimization Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources ASGs, spot instances, reserved instances
Documentation Document your code, architecture, and practices Create playbooks to respond to incidents READMEs, wikis, Slack
Tests Write automated tests for your infrastructure code Run tests after every commit and nightly Terratest
Trang 49Key takeaway: use a checklist to build
Trang 50Full checklist: gruntwork.io/devops-checklist/
Trang 511
3
4
5
Trang 52What tools do you use to
Trang 54Here’s the toolset we’ve found
Trang 55Server Server Server Server Server
Networking, Load Balancers, Databases, Users, Permissions, etc
1 Deploy all the basic infrastructure
Trang 56Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc
VM VM VM VM VM
Trang 57Server Server Server Server Server
Networking, Load Balancers, Databases, Users, Permissions, etc
VM VM VM VM VM
3 Some of the VMs form a cluster
Trang 58Server Server Server Server Server
Networking, Load Balancers, Databases, Users, Permissions, etc
Trang 59Server Server Server Server Server
Networking, Load Balancers, Databases, Users, Permissions, etc
Trang 64New way: make changes
Trang 66More time than making a
Trang 67If you make changes manually,
Trang 68And the next person to try to
Trang 69So then they’ll fall back and
Trang 70But making manual changes
Trang 72Key takeaway: tools are not enough
Trang 74It’s tempting to define all of your
dev
qa test stage prod
Trang 75Downsides: runs slower; harder to understand;
harder to review (plan output unreadable); harder
to test; harder to reuse code; need admin
dev
qa test stage prod
Trang 76Also, a mistake anywhere could break
dev
qa test stage prod
Trang 77qa test stage prod
Trang 78What you really want is
Trang 79MySQL VPC
Frontend
Trang 81And break it up into small, reusable,
module
module module
module module module
module
Trang 82└ dev └ stage └ prod
Trang 83└ dev
└ vpc
└ mysql └ frontend └ stage
└ vpc
└ mysql └ frontend └ prod
└ vpc
└ mysql └ frontend
Trang 85gruntwork-io └ asg
└ alb └ ssh
Trang 88
/modules: implementation code, broken
Trang 89install-xxx: sub-module to install the
Trang 90run-xxx: sub-module to launch the
Trang 91xxx-cluster: sub-module to deploy
Trang 92xxx-yyy: sub-modules with shareable
Trang 93Each sub-module exposes variables for
Trang 94Small, configurable sub-modules
Trang 95As you can combine and compose
Trang 96/examples: Runnable example code for
Trang 100Typically, our tests deploy & validate
Trang 101Key takeaway: build infrastructure
Trang 1021
2
3
5
Trang 103Infrastructure code rots very
Trang 105Infrastructure code without
Trang 106For general-purpose languages, we
Trang 107For infrastructure as code tools,
Trang 109We write these integration tests in
Trang 110Terratest philosophy: how
Trang 113terraformOptions := &terraform.Options {
TerraformDir: " /examples/vault-with-elb", }
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validateServerIsWorking(t, terraformOptions)
Run terraform init and
terraform apply to deploy
Trang 114terraformOptions := &terraform.Options {
TerraformDir: " /examples/vault-with-elb", }
defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions)
validateServerIsWorking(t, terraformOptions)
Validate the infrastructure
Trang 115// Get IPs of servers
aws.GetPublicIpsOfEc2Instances(t, ids, region)
// Make HTTP requests in a retry loop
http.GetWithRetry(t, url, 200, expected, retries, sleep)
// Run command over SSH
Terratest has many tools built-in for validation
Trang 117Note: tests create and destroy
Trang 118Pro tip #1: run tests in completely
Trang 119Pro tip #2: clean up left-over
Trang 120e2e
Tests
Integration Tests
Unit Tests
Trang 121As you go up the pyramid, tests get
e2e Tests
Integration Tests
Unit Tests
Trang 122How the test pyramid works
Trang 123Unit tests for infrastructure code: test
e2e Tests
Integration Tests
Unit Tests
Trang 124Integration tests for infrastructure code:
e2e Tests
Integration Tests
Unit Tests
Trang 125e2e
Tests
Integration Tests
Unit Tests
Trang 126Note the test times! This is another
e2e Tests
Integration Tests
Unit Tests
Trang 127Make sure to check out Terratest best
Trang 128Key takeaway: infrastructure code
Trang 1291
2
3
4
Trang 130Let’s put it all together:
Trang 131Task Description Example tools
Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening ACM, EBS Volumes, Cognito, Vault, CiS Monitoring Availability metrics, business metrics, app metrics, server, metrics, events, observability, tracing, alerting CloudWatch, DataDog, New Relic, Honeycomb
Logs Rotate logs on disk Aggregate log data to a central location CloudWatch Logs, ELK, Sumo Logic, Papertrail
Trang 132
2 Write some code
Trang 1366 Promote that versioned code from
Trang 138Before…
Trang 139info@gruntwork.io