Pacific Rim: Learning to eat soup with a knife
Credit to Author: Ross McKerchar| Date: Thu, 31 Oct 2024 12:36:53 +0000
Embedded architecture devices such as network appliances haven’t historically been top-of-the-backlog when it comes to security features, and during Pacific Rim they became the subject of an escalating arms race – one that blue teamers, and not just those at Sophos, must get a handle on.
The good news is that many of our existing principles transfer extremely well: More recent network appliance technology is based on well-understood OS’s such as Linux variants. The bad news is that some of those principles may need tweaking. While technology has progressed, there is still a high proportion of devices in the field running arcane, security-unaware embedded architectures – sitting on racks collecting dust.
Of course Sophos, as an information-security company, has a dual view of security and response; we respond not only to incidents that affect us as a company, but to incidents that affect our products and services – the “us” that is sent into the wider world. Our incident response processes, therefore, extend beyond our own corporate environment to the very infrastructure we deploy for our customers. It’s a particular kind of double vision, which – we hope – gives us a leg up on thinking about how to evolve incident-response principles to meet current needs.
Actually making the dual-view system work, though, requires close cooperation between the groups that develop our products and the group tasked with responding to security issues concerning them, our Product Security Incident Response Team (PSIRT). Since not all enterprises have (or have need of) a PSIRT, before we dig into our findings, it’s good to explain how our PSIRT operates.
Life in the Sophos PSIRT
Our PSIRT monitors several channels for information about new findings in Sophos products and services. For example, as we mentioned in a recent article which provided transparency into Sophos Intercept X (a follow-up explored our content update architecture), we’ve participated in an external bug bounty program since December 14, 2017 – as it turned out, just short of a year before the first ripples of what became Pacific Rim — and welcome the scrutiny and collaborative opportunities that this brings. Our responsible disclosure policy also offers ‘safe harbor’ for security researchers who disclose findings in good faith. In addition to external reports, we also conduct our own internal testing and open-source monitoring.
When PSIRT gets an incoming security event, the team triages it – confirming, measuring, communicating, and tracking to ensure our response is proportionate, safe, and adequate. If necessary, we escalate issues to our Global Security Operations Centre (GSOC), which is follow-the-sun with over a dozen outposts coordinating on cases 24/7.
Our PSIRT drives remediation, working with our product SMEs to offer technical security guidance, and moving towards resolution alongside response standards – enabling our customers to effectively manage relevant risks in a timely manner. We aim to clearly communicate outcomes in actionable security advisories and comprehensive CVEs – including CVSS scores, and Common Weakness Enumeration (CWE) and Common Attack Pattern Enumeration and Classification (CAPEC) information.
In addition to being just generally best PSIRT practice, this all factors into our commitment to CISA’s Secure by Design initiative. In fact, Sophos was one of the first organizations to commit to the initiative’s pledge, and you can see details of our specific pledges here. (An essay from our CEO, Joe Levy dives deeply into our commitment to Secure by Design and how, with everything we learned from Pacific Rim, we mean to carry that commitment forward.)
Of course, a good PSIRT doesn’t just wait for reports to come to it. In the background, as well as performing its own testing and research, the team also works to mature our product security standards, frameworks, and guidelines; perform root cause analyses; and continuously improve our processes based on feedback from both internal and external stakeholders.
All these tasks inform what we’ll discuss in the rest of this article, as we break down what we learned from iterating and improving our processes over the life of Pacific Rim. We’ll talk about principles – many of which we have implemented or are in the process of implementing ourselves – as a starting point for a longer conversation among practitioners about what effective and scalable response looks like when it comes to network appliances.
What we learned
Telemetry
It all starts with being able to capture state and changes on the device itself. Network appliances can often be overlooked as devices in their own right, as their usual role is as “invisible” carriers of network traffic. However, this distinction is an important step to provide observability on the device – essential for response.
Key challenges:
- Network plane vs control plane. We don’t want to monitor your network (the network plane). Not in the least. We do, however, want to monitor the device that manages your network (the control plane). This distinction is often logical rather than material, but has become an important distinction to ensure we can preserve customer privacy.
- On-device resource availability. These appliances are still small devices, with limited RAM and CPU resource availability. Telemetry capture functions must be streamlined to avoid unnecessary service degradation for the device’s primary function. (That said, resource capacity has improved in recent years – which, unfortunately, means it is easier for attackers to hide in the noise. Admins are less likely to accidentally wipe an attacker off a device with an inadvertently judicious hard reboot when they notice that the firewall is running slowly for the whole network, because the modern firewall can tolerate bloatware and thus doesn’t exhibit the same distress.)
- Noisy data capture. Network appliances are built differently. While a /tmp folder may be reasonably quiet on a user endpoint – and worthy of active monitoring – it can be considerably noisier on a network appliance. Tuning is important to make sure the telemetry isn’t flooded with noise.
Streaming
Whether the detection occurs on the device or in a back-end data lake (more on that below), there will inevitably be a point at which the acquired telemetry should be sent off the device. While many of these principles are well-documented for the security monitoring field, there are some unique challenges for network appliances.
Key challenges:
- Host interference / NIC setup. Network appliances are already touchy when it comes to network interface management and how the host itself affects the traffic it carries. Adding in an extra data stream output often takes a fair bit of re-architecting. Good technology selections that cause minimal interference are vital to ensure a firebreak between response and device operation. OSQuery stands out as a great example of a technology that can support near-real-time querying while reducing the risk of resource impact.
- Collection vs. selection. Collection of the entirety of a user’s network traffic is both a massive privacy concern and an extremely inefficient form of detection engineering. “Selecting” the most relevant data using rulesets (that can be created, edited, tested, and deployed) is a standard practice for high-volume collection, but requires well-documented (and audited) selection criteria to make it work. This distinction also allows for judicious application of retention policies – longer for selected data and shorter for collection.
Triggers, tripwires, and detections
The next stage is discerning signal from noise. As cybersecurity specialists, we are often taught to look for the absence of the normal and the presence of the abnormal – but the definition of both varies widely in network appliances.
Key challenges:
- Telemetry choices + streaming choices = blind spots. Knowingly selecting a subset of collection, while necessary, creates gaps that need to be constantly re-assessed on the fly. Excluding /tmp from collection may be the right move to reduce noise, but leaves it as a perfect staging ground for malware. Practitioners must find ways to monitor these blind spots with lower granularity “tripwires” such as file integrity monitoring.
- Writing detections over selected data. While having the subset of selected data is a good start, this is likely to still be too much noise to process. We found that at this point, detection engineering practices could then be implemented on the selected data – ideally in a normalized schema alongside other security telemetry, to promote pivoting.
Response actions
We’re talking about core network infrastructure, which does not respond well to aggressive tactics. While on a user endpoint we may think nothing of terminating a suspected rogue process or isolating a device from a network, doing either on a network appliance could have catastrophic availability impacts to a user network. In our experience, at this stage some firm guardrails, setting expectations and stopping response activity from making the incident worse, were tremendously helpful.
Key challenges:
- Network availability impacts. “Turning it off and on again” hits different when we’re talking about an entire organization’s internet access. Implementing any response actions – scalable/automated or otherwise – must be treated as a potentially highly impactful business change, and must follow a change management process.
- Network vs control plane (again). It matters at the point of data collection, and it matters during remediation too. Knowing where jurisdiction ends between the responder and the user of the network is vital to ensure a limit of exploitation for response actions, and a limit of exposure for any adverse impact.
- Commercial and legal limitations. At this point, the conversation begins to expand past technical response practitioners and to members of the extended response team – particularly Legal and the executive suite. Among the questions to raise with those stakeholders: Who owns the risk if a response action disables a network? Who owns the risk if that action isn’t taken, leaving the network vulnerable?
Conclusion
Necessity is the mother of invention, and it is fair to say that Pacific Rim has shown us that there is more to do in the field of incident response for network appliances. The application of these basic principles has allowed us to protect our customers to a level that we never thought possible, but it has also identified some important limitations that practitioners need to address – some in their own organizations, some in-house at each vendor, some industry-wide. Topics such as network availability, data privacy, and limits of liability, when it comes to response actions, require not only technical but commercial and legal frameworks. Difficult as these topics may be to discuss, let alone implement, it is a conversation we must entertain in multiple venues if we are to keep up with the evolution of these threats.
Sophos X-Ops is happy to collaborate with others and share additional detailed IOCs on a case-by-case basis. Contact us via pacific_rim[@]sophos.com.
For the full story, please see our landing page: Sophos Pacific Rim: Counter-Offensive Against Chinese Cyber Threats.