Defcon: Preventing Overload with Graceful Feature Degradation - Outline - What is the research and why does it matter? - How does it work? - How is the research evaluated? - Conclusion - Overview - [[The Five C's]] - [ ] **Category**: What type of paper is this? A measurement paper? An analysis of an existing system? A description of a research prototype? - [ ] **Context**: Which other papers is it related to? Which theoretical bases were used to analyze the problem? - [ ] **Correctness**: Do the assumptions appear to be valid? - [ ] **Contributions**: What are the paper’s main contributions? - [ ] **Clarity**: Is the paper well written? - Notes - Introduction - interesting that they focus on availability, not latency (when latency is arguably similarly important) - Latency matters! - [https://www.thinkwithgoogle.com/future-of-marketing/digital-transformation/the-google-gospel-of-speed-urs-hoelzle/](https://www.thinkwithgoogle.com/future-of-marketing/digital-transformation/the-google-gospel-of-speed-urs-hoelzle/) - TODO real surge of traffic, figure 1 - Fail-slow - [https://www.micahlerner.com/2023/04/16/perseus-a-fail-slow-detection-framework-for-cloud-storage-systems.html](https://www.micahlerner.com/2023/04/16/perseus-a-fail-slow-detection-framework-for-cloud-storage-systems.html) - Metastable failures - [https://www.micahlerner.com/2022/07/11/metastable-failures-in-the-wild.html](https://www.micahlerner.com/2022/07/11/metastable-failures-in-the-wild.html) - Background - Datacenter capacity management - There is a clear connection to Flux here - Standard resources called RRUs - The overload problem - They have a table of what to do when there is overload - Table 1 - They reference metastable failures explicitly! - Challenges can occur, for example COVID made projections for capacity growth totally wrong - TODO figure 2 - Reasoning: reduce user impact, actually potentially get some pros out of it, but there is engineering effort - Related work - They talk about a variety of related work, but probably the most closely related is the "availability knob" - Graceful feature degradation - Defcon - Overview - TODO figure 3 - Knob Definition Framework - Framework that runs without restart, reading state of the flag - [https://12factor.net/config](https://12factor.net/config) - Two types of knobs - Server-side: adjust knobs in seconds, without propagation delays - Client-side: actually implemented on device - Push change to client - Mobile Configuration Pull (client pull states from server on a recurring basis), in an emergency there is a backup - Knob definiton - TODO listing 1 - Represents things about the knob in code (for example oncallers, namespacing, knob is enabled) - Usage of the knob - If statements based on whether the knob is on or not - TODO listing 2 - This also integrates with other common Meta systems for launching features - Knob Actuator Service - Service for storing knobs (maybe an index on the knobs) - MySQL database - "Knob metadata includes: (1) The engineering oncall responsible for the knob’s definition, (2) the engineering team responsible for the knob’s usage, and (3) a cache of recent resource and user experience test results (discussed later in this section)." - Oncallers can also make changes that the actuator service carries out - Knob Testing Framework - Gather data about what manipulating knobs does - Two types of tests - Small scale tests - different groups have the knobs turned off in different permutations (different defcon levels) - Quarterly 100% of user tests - Individual and across Meta tests - Degradation Policy - Different levels of degradation - L1 - L4 (L1 is the most serious) - Emergency responders can decide which policy to apply - Evaluation - Measurement methodology - Different datasources - Realtime monitoring - apparently what the hardware thinks is happening - Resource Utilizaiton - have this dataset based on loadtests (seems similar to Flux) - Transitive Resource utilization - they used distributed tracing (which again seems similar to Flux) - User Behavior Measurement - they look to see how users behave during a tests - They have a forecast and backcast to improve the model over time / capture error - [https://facebook.github.io/prophet/](https://facebook.github.io/prophet/) - TODO figure 5 - TODO figure 7 - They can calculate what the savings is - Individual product tests - For different product areas, higher levels of degradation save more resources - TODO figure 8 - Summary: small impact to user interactions, but there are big wins in not having the site overloaded - TODO table 2 - Combined product tests - there are shared services that they want to performance of (they reference memcache) - Scaling memcache at FB paper - TODO figure 9 - they also have user reports that correspond to using the knobs - figure 10 - Transitive resource savings - Downstream services with no knobs, can still benefit - For example, TAO doesn't get overloaded as well - TODO figure 12 - TODO figure 13 - Get breakdowns of why a downstream service is receiving traffic - Could some of this be done with the tracing framework? - There are also interdependence between products (for example, Instagram -> Facebook) - Outage simulation testing - The pattern of trying to force failure is common among SRE teams - They test by redirecting traffic, forcing overload, and then using knobs. They claim this type of testing is general - TODO figure 17-20 - Real world large scale outage - Incident manager will decide what degradation policy to apply - In the incident, they used L3 knobs, instead of L2 - Lessons Learned - Focus on user impact to define knobs - Graceful gradation doesn't come for free, you need to train people on how to use it - There needs to be buy in from product teams - Knobs once built, need regular maintenance - Independent system prevents failures - Developer experience is key - Conclusion