DevOps'ish is assembled by open source advocate, DevOps leader, Cloud Native Computing Foundation (CNCF) Ambassador, Kubernetes and KubeWeekly contributor, Chris Short
|
Another week another bout of bad weather. Systems here in our home have gotten a bit more robust since our multi-day total blackout. I took a meeting this week in a house with no power. The meeting was short, but it demonstrated that if everything goes to hell in a handbasket, my systems are redundant enough to enable me to pass whatever batons when needed. But, lately, it’s felt like a lot. You can feel the cost of communication when a cacophony of UPSes suddenly fills the house. Luckily power was restored before we went to bed that night. But, what came later was something of a surprise. In 36 hours, Michigan received almost a quarter of its annual total of lightning stikes (a lot of them cloud to ground). While this didn’t seem to affect services we consume, I can only imagine the hell it played out for multiple fire responders of all stripes. One of the worse incidents I was part of was a lightning strike that hit a datacenter’s generator transfer switch. It kicked off a chaotic series of events that caused HVAC systems to go offline. The storm that night was hellacious too. A datacenter can generate enough heat to make network switches act up is a miserable series of events. There was no single root cause. Multiple systems failed or malfunctioned in unplanned or thought of ways. The fact we weren’t up and running once temperatures started to cool down unlocked a new mystery that ultimately led us to restart our core switches because the heat had thrown the ASICs out of whack. But, there was never a single root cause. You could say the lightning strike was the root cause. But, that hit systems outside the datacenter and related to power. Our systems went down because core switching had overheated. Cooling units inside the datacenter reset but didn’t start using refrigerant until they were reset again in a particular order (the cooling system was never supposed to respond the way it did). There’s never a single root cause for a large-scale outage (John Allspaw argues the point further below). But, there was never a single root cause. You could say the lightning strike was the root cause. But, that hit systems outside the datacenter and related to power. Our systems went down because core switching had overheated. Cooling units inside the datacenter reset but didn’t actually start using refrigerant until they were reset again in a certain order (the cooling system was never supposed to respond the way it did). There’s never a single root cause for a large scale outage (John Allspaw argues the point further below). Large-scale systems (and some not-so-large ones) are too complex for us to understand completely. We make assumptions past a certain threshold of knowledge. We have to start being more diligent about our assumptions and developing a better understanding of how our systems perform. We need to apply those lessons to reduce wasteful spending, protect our systems, and improve the quality of our services.
|
Processxkcd: Every Data Table* 2020 † 2021 🤣🤣🤣 Management platform for Infrastructure as Code Automation and CollaborationSee how env0 automates and simplifies the provisioning of cloud deployments for Terraform, Terragrunt and GitOps workflows. Variables and Secrets granularity, Full CLI support, integration with OPA, Dymanic RBAC and quality of life features. Free Demo SPONSORED Root cause of failure, root cause of success“That’s the point of the thought exercise. 🙂 Finding a single ‘root cause’ of a failure is the same as finding a single ‘root cause’ of a success — subject to all pitfalls in doing so. 🙂 — John Allspaw” How to audit and secure an AWS account“But where do you start when it comes to securing your AWS account?” Here. You can start here. Firewalls and middleboxes can be weaponized for gigantic DDoS attacks“Academics discover novel DDoS attack vector abusing the TCP protocol… The new DDoS technique can be used to launch attacks with amplification factors in the realm of 1000x and more.” Well this is bad. Service Reliability Math that Every Engineer Should Know“For a service to be up 99.99999% of the time, it can only be down at most 3 seconds every year. Unfortunately, achieving that milestone is a herculean task, even for the most experienced site reliability engineering teams.” Facebook, Google, Isovalent, Microsoft and Netflix Launch eBPF Foundation as Part of the Linux FoundationLightning strikes. Notes on the Perfidy of DashboardsSome very salient points about dashboards. Dashboards need to be dynamic, provide context, and be able to show useful, actionable information. “We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.”
|
ToolsGo, Rust “net” library affected by critical IP address validation vulnerability“[A]pplications relying on net could be vulnerable to indeterminate Server-Side Request Forgery (SSRF) and Remote File Inclusion (RFI) vulnerabilities.” Manage incidents directly from Slack 🧑🚒 Rootly helps automate the tedious manual work like creating incident channels, searching for runbooks, documenting the postmortem timeline, and more. Teams sized 20 to 2000 manage hundreds of incidents daily and save thousands of engineering hours a year within Rootly. Get started in <5min or book a demo to learn more and get Starbucks ☕ on us! SPONSOREDSecurity tools showcased at Black Hat USA 2021This list is amazing. It includes Kubestriker, Kubesploit, and many other great tools showcased at Black Hat this year. This is why Valve is switching from Debian to Arch for Steam Deck’s Linux OSIt always fascinates me when a project, like Steam Deck, chooses a Linux distro. Choosing Arch allows for Steam to make more iterative updates quickly. Including kernel changes. Which would be a pain to do on Debian without some kind of forking. The thing that gets me is that I fully intend to own a Steam Deck. I also plan to use it as a desktop computer as well. I wonder how consumers will use this device and how Steam will handle supporting it. A RedMonk Conversation: Arm64 for the Best Price/Performance on AWS: Why You Should Take The Graviton ChallengeEveryone knows I’m a big ARM fan. ARM dominating the mobile phone market is just the beginning. Cheaper but performant, compute that allows for savings to get passed on to the buyer is hard to pass up. If you’re smart, you’ll commit to being able to run new services on ARM very soon. Then figure out how to get the rest of your codebase ported over the next two years should be a top priority. You’ll be light years ahead when some amazing chips start shipping. Note: This has to go better than your switch to IPv6. A Container Security Checklist“Published by O’Reilly, Liz Rice’s Container Security book provides a security checklist covering the need-to-know when deciding how to protect deployments running on containers. Liz gave us an outline of the checklist in her GOTOpia Europe 2020 presentation and took a deep dive into the specifics of certain likely vulnerabilities that you need to prevent.” GNU nano is my editor of choiceAnd there is absolutely nothing wrong with loving nano. I’m using VScode and the now builtin vim bindings. There is a nano keybindings plugin too. Bobbycar - A demonstration of Red Hat Open Hybrid Cloud, the platform for your IoT solutionsThis is a really cool demo. kubernetes-csi/csi-lib-utilsCommon code for Kubernetes CSI sidecar containers (e.g. external-attacher, external-provisioner, etc. bytecodealliance/lucetLucet, the Sandboxing WebAssembly Compiler. andy-5/wslgitUse Git installed in Bash on Windows/Windows Subsystem for Linux (WSL) from Windows and Visual Studio Code (VSCode) Call-for-Code-for-Racial-Justice/Five-Fifths-Voter“Five Fifths Voter is a web application tool designed to enable and empower Black people and others to exercise their right to vote by ensuring their voice is heard”
|
DevOps’ish Tweet of the Week
|
Join in on the conversation on /r/devopsish for a stream of news, content, and commentary throughout the week. Want more? Be sure to check out the notes from this week’s issue to see what didn’t make it to the newsletter. Have feedback? Hit Reply.
|
|
|
|