Tuesday, February 28, 2017

Why We Can't Have Nice Things: The Great SSH Security Debate

So recently I found myself in another SSH security debate on Twitter. This is a subject long debated in the security community and one where I feel like we go in circles at times. Here are my thoughts about running SSH services at scale and where things sometimes go amiss.

On the standard vs. non-standard port debate

First, while I agree that running SSH on a non-standard port is an example of "security by obscurity," this has some tactical advantages. Before you read further, I suggest you try this experiment:

1. Spin up two Linux instances, in a public cloud running SSH; one using keys, and one using passwords.
1a. Start two SSH listeners, one using the standard port 22 and one using a random port not included in any default portscan list (e.g. a default nmap scan.) 
1b. Make the SSH ports generally accessible.
2. Instrument each instance with an intrusion detection tool capable of detecting brute force attempts. One (free) option is the OSSEC LIDS (log based intrusion detection) agent. Additional intrusion detection tools are available in the marketplaces.
3. Count the size of the message data structures, and number of associated alerts, each instance generates per day due to brute force activity on port 22 vs. the pseudorandom port.
4. Extrapolate this to a medium or large sized fleet of, say, ten or twenty thousand instances. 

In the public clouds, brute force alerts, like many alerts produced by many forms of automated reconnaissance and attempted intrusion campaigns, are supernumerary. In a medium sized fleet, you can easily generate more than one hundred thousand such alerts per day. Assuming you collect and retain security alerts and associated log data for, say, three to six months, we are now burning money (in the form of compute and storage) processing brute force alerts. We're also creating alert fatigue and distracting security analyst / hunters who could be working on something more productive like hunting structured threats. I would argue that running SSH on a non-standard port, in order to manage alert fatigue, is a useful tactic.

Where resources permit, I  would actually suggest running three SSH services:

1) A functional service on a nonstandard port which accepts root logins;
2) A functional service on a nonstandard port which accepts non-root logins, if use cases for this exist;
3) A decoy, non-functional service on port 22 that accepts no logins and has no shell.

With this combination, you can adjust your analytics to lower the priority of campaigns against port 22 by unstructured threats that may never realize they're not interrogating a working service. At the same time, alerts involving campaigns against the working services can be raised in priority as these tend to suggest a more determined human attacker who has taken the trouble to find the real SSH services. These are more interesting things to hunt, and time better spent.

On the Bastion Host Design Pattern

Few of us would argue that placing a Bastion in front of a production SSH service is not a useful tactic. Placing a bastion inline, like a firewall, feels safer and this makes everyone feel good. Where this goes wrong, too often, is in at least a couple of ways I can cite:

1. Assumptions may be made that the presence of the bastion has created sort of impenetrable condition that allows security to be relaxed in the environment behind the bastion. A bastion host is not a perfect defense, more than any other technology, and security needs to be applied systemically using the assumption that any single point, including a bastion, may fail at some time.

2. Ineffective or incoherent identity management and logging on SSH sessions or tunnels crossing the bastion. Figuring out exactly who did what - disambiguating user identity and / or context - too often proves to be infeasable in reality due to inadequate logging or user identity management across bastions. 

Given a choice between a bastioned environment with weak logging where I cannot establish user context, and a non-bastioned environment with strong identity management and logging, where I an establish user identity, I'd actually tend to choose the latter. Given a choice, I think using a 2FA VPN with hardware tokens, as some are increasingly doing, with very strong identity management and logging, is often preferable to a simple bastion.

On SSH network security in the public cloud

Applying network access controls to SSH is a fine idea in principle that sometimes breaks down due to account sprawl. Organizations love to create lots of accounts in order to create safeguards against different product or application teams stepping on each other's work. Keeping a large and dynamic list of accounts connected to a "bastion" SSH VPC  - via peering or VPN - can be harder than it sounds. Security groups don't cross accounts and there is no obvious way to manage security groups or network ACLS across the AWS account boundary.  I sometimes wish that we could use account ID as a parameter in security groups in order to allow instances from any account, in the same billing entity organization, to reach certain network services. I've actually suggested this in the past but this its not a simple feature request. 

Tuesday, January 03, 2017

There's More Than Five Tuples: Network Security Monitoring With Syscalls

In modern cloud environments like AWS, we have a few options for doing network security monitoring and anomaly detection. We could use native OS-level firewall logs; we could layer on an instance based firewall product; or we could gather flow logs as long as our instances are running in a VPC. The disadvantage of all these approaches is that they are so-called "5-tuple" data types - limited to source and destination IP addresses, ports, and protocols. This is evident when we examine VPC flow logs which are similar to netflow:

Syscall data is superior as it is a 15-tuple data type with much richer data including key contextual parameters in addition to the traditional five tuples:
  • The name and path of the process connected to the socket (e.g. which program or service engaged in the network activity)
  • The commands and / or arguments associated with the process (e.g. what was the program trying to do)
  • The user and group associated with the process (e.g. who did it)
  • The PID and PPID of the process
With conventional flow or firewall logs, this kind of contextual detail often requires manual live response which is inherently unscalable due to its labor-intensive nature. Even where we have hunting teams available to do live response and chase down anomalous network activity, intermittent network activity tends to resist root causing unless an analyst happens to be doing live response when the activity manifests. Malware frameworks have long used intermittent and irregular command-and-control beaconing, for this reason, in order to resist detection and prolong persistence. In the cloud, we have another problem: servers may or may not live long enough for an analyst to perform live response. In semi- or hyper-ephemeral environments where instances are frequently created and destroyed, there may be simply be no instance to perform live response on by the time an alert rises to the top of the analysis queue, making effective threat hunting ineffective. In other cases, an instance may have been terminated by a developer or technician who thought this was the best course of action. In some environments, while servers persist long enough for live response, live response is infeasible when SSH or other command shells are unavailable due to philosophical design decisions that SSH or command shell accessibility is an anti-pattern.

The answer to all of this is to instrument in advance and gathering syscall data is the best way I've found in six years to do network security monitoring in the cloud. The Threat Stack technology is possibly the best way to gather and process syscall data at scale.. I've been doing active development in the tool for a while now and have arrived at a set of twenty rules that can detect any kind of network anomaly and produce actionable alerts. They're much more useful to most kinds of network alerts based on flow logs or firewall logs because I can do network behavior anomaly detection (NBAD) in fifteen dimensions instead of five. For example, consider these syscall logs on activity from an nmap scanner:

You can see all 15 tuples in the events which allow us to root cause this network behavior as an nmap scan in a matter of seconds without the need to get on the instance and perform live response in order to try and identify the process connected to the sockets, the user who invoked the process, the commands that were executed, and so on.

Using syscall data on file events, one can also perform complex and subtle anomaly detection on file and process events and I plan to explore this next. Example use cases include sensitive file access by shells or editors; file access by data exfil vectors like scp or ftp; and anomalous file access in general. Anomalous file access is another promising method for doing behavioral detection of exploit attempts; for example, a random process, like a Dirty Cow exploit, writing to /etc/passwd, or a non-mySQL process writing to a mySQL configuration file.

Spacefolding Redux: Increasing Hunting Velocity and Mean-Time-To-Know (MTTK)

(Originally written for the Threat Stack blog)
In our last post, we took a look at traditional security incident response vs. the possibility to dramatically increase security velocity (which I affectionately nicknamed “spacefolding”).
We viewed this through the lens of a conventional response timeline that can take hours and days — versus seeing into exactly what occurred and decreasing the Mean Time-To-Know (MTTK) for a security incident -- because all of the relevant information is visible and available to you.
In this post, we’ll take this premise into a real-world example that may be familiar to many organizations running instances on AWS.Consider a routine scanning abuse complaint as an example investigation. When an EC2 instance is observed to be scanning another server, AWS security will issue an abuse report to the instance owner — a sort of admonishment that your instance has been naughty and its behavior must be dealt with.
These reports are typically very terse and may include few details other than the destination port count was exactly one thousand...One thousand ports is exactly the number targeted by a default nmap (the network mapper) scan, and we can surmise that an unauthorized nmap scan is the likely explanation. How do we ascertain this? We need to ask and answer several questions:
  1. What? Was the cause of the activity indeed a scanner like nmap or was it a misbehaving application? The former is a case for security; the latter a case for the application owner.
  2. Who? If it was a scanner, which user ran the scan? Who did it?
  3. Why? Why did a user run a scanner? Was it really them or did someone else login as them using their password?
Answering these questions, in most organizations, takes lots of time and effort.
First, we have to locate the offending EC2 instance and identify whose EC2 account it is running in.
Next, we have to identify the instance and / or application owner and ask them if they can explain this behavior. If the answer is No, as it probably is, we have to obtain keys to the instance in order to login and investigate ourselves. By this time, the running state we could use to solve this mystery has expired and is no longer present for us to observe — the scanner has stopped, its network connections have expired, and the user who did it is no longer logged in.
We set about examining system logs and find nothing of significance because ephemeral events like process execution and network activity do not typically leave traces in system logs because this level of detail would grow the logs until they swamped the file system.
If we’re lucky, there may be some shell command history that we can use to identify which user ran the nmap scanner and scanned the complainant server. If we’re unlucky, we may have to examine authentication logs for the entire day, or more, and question each user one by one until we eventually learn that a user did indeed run the nmap scanner while troubleshooting network connectivity to a remote instance. The user, a support technician, forgot to specify a port parameter and accidentally ran a default nmap scan which covers a thousand common ports, which was flagged by the EC2 security team.
nmap-02.pngFigure 1. Syscall events reveal that the network activity came from the nmap scanner.
nmap-03.pngFigure 2. Syscall events identify the original command run by the user who invoked the nmap scanner.
nmap-01.pngFigure 3. Unified authentication events identify where and when the user logged in.

There IS a Better Way

How does using Threat Stack in a routine case like this improve our Mean-Time-To-Know?
With Threat Stack, our team can easily replay system calls at the time of the scan and answer our first two questions above — What? and Who? — using the resulting data. This would allow us to fast forward to step three — Why? — and simply ask the user why she ran this command.
The time savings in even this routine case are significant — minutes instead of hoursThis is a velocity increase of a factor of sixty! In more complex cases, I project, the velocity increase may range as high as 200 times. Increased velocity provides blue teams with a tactical advantage; and as blue teams will tell you, they will gladly exploit any tactical advantage they can, because too often the attackers have the advantage. The ability to detect and respond closer to the speed of threats will provide a massive increase in productivity for overloaded security incident response teams.
For more advanced threat hunting teams, a velocity increase provides the ability to disrupt threats, before significant damage is done, instead of simply detecting and responding to losses that have already occurred in the past. This would be the secondary, and probably much larger, benefit of increased velocity.

A Final Word . . .

Spacefolding, for our purposes in this case, refers to a platform like Threat Stack, and the benefit is reducing MTTK. While we are still constrained by the laws of physics and spacetime, we can still significantly impact our response velocity, and potentially disrupt attackers using a purpose-built platform.

Tuesday, November 01, 2016

Increasing Security Velocity With "Spacefolding"

(Originally written for the Threat Stack blog
I recently added a Starz subscription to my Amazon Prime and found a new supply of science fiction movies. One of these, Deja Vu, is a time travel story from a decade ago; a weird mashup of the post-9/11 terror attack genre mixed with science fiction. In the film, a terror attack takes place in New Orleans and a small army of government men-in-black from various state and Federal agencies respond. Because the attack involved a ferry, the NTSB and FBI collaborate along with elements of the ATF, including a talented investigator played by Denzel Washington.
While the FBI / NTSB task force sets about the painstaking work of accident reconstruction and crime
Denzel-1.png scene forensics, Denzel’s character is recruited by a sort of super-secret element of DHS using an experimental technology called “spacefolding” to directly observe the past. The “spacefolding” machine displays a single point in space exactly 48 hours in the (relative) past. The DHS time scientists recruit Denzel’s character because they realize they need an investigator to know where to look, in order to be looking in the right place during the prelude to the attack, and solve the case by witnessing the perpetrators in action.

OK, you’re saying, I’m due back on Earth now. All of this is fun science fiction and vaguely entertaining, but what does it have to do with anything, let alone security velocity?
Well, back, in the real world, we cannot fold space and observe the past — but what if we could?

We have experienced similar challenges in the realms of security threat hunting, host intrusion detection, and incident response for decades. When investigating IDS and other alerts, security teams often try to partially reconstruct into the past and divine what happened. This examination of a running system is called live response and involves the sifting of logs and artifacts for clues not altogether unlike an accident reconstruction or crime scene forensic technician, albeit less formal in methodology.
Consider the differential time and effort cost of the two approaches in the film:
Security analysts examine current state including things like open ports and sockets, attached processes, file handles, and active user sessions. If current state is unrevealing, because the activity under investigation took place in the past, analysts gather logs and file systems and start creating timelines — another method of attempting to reconstruct the past.
What if we could actually see the past instead of painstakingly reconstructing it? This would give us a massive shortcut to answering questions during live response and routine investigation. My recent work with Threat Stack is as close to Spacefolding as I can imagine getting — using the TTY Timeline, one can actually go back to events that occurred in the past, observe what happened, and get answers in minutes. For a typical security team, this can reduce live response time from hours to minutes (100–200x). Consider the difference between conventional live response and observing past security events by “spacefolding”:
So how could we observe the past? Much of what we know about physics suggests we are stuck in a linear time existence. There may be additional dimensions, including some with possibilities of nonlinear time, but that doesn’t help us here. What we can do is to record state in great detail and play it back using a reference monitor connected to an enormous logging and analytics engine. Imagine using something like auditd to record all system calls or syscalls — command and process activity, file activity, network connections with attached processes, user logins and privilege elevations, and TTY command history. If we record this level of detail into a database that allows us to query and sift the data, we can observe detailed state and past events on a server instance in the past.tty-timeline-image.png

In my next post, I will dig deeper into an actual use case of “Spacefolding” with Threat Stack and how it can dramatically increase security velocity. 

Sunday, July 12, 2015

Simply Explained: Why Do We Need So Many Security Test Things?

Software is like entropy. It is difficult to grasp, weighs nothing, and obeys the second law of thermodynamics; i.e. it always increases.
- Norman Ralph Augustine

Why is security testing so complicated? Why do we need so many kinds of security tests for web applications including both dynamic analysis (or DAST) and static analysis or (SAST) - in addition to threat modeling, whatever that is? 

During the recent controversy on airplane security topics, I finally thought of an analogy to explain why we need so many different kinds of activities in software security lifecycle endeavors. Typically among the first questions asked about software security is something to the effect of,  "should we use this vulnerability scanner or that?" The answer, of course, is that you need more than a simple vulnerability scan to test modern web applications. 

In the vulnerability management world, things are simpler - scanners interrogate listening ports and test binary services for known vulnerabilities using a catalogue of known flaws. It's a simple test -because- there is a catalogue; the list of vulnerabilities you're looking for has been provided in advance by security researchers and / or vulnerability management vendors.These assumptions break down in the web application space because web applications tend to be unique; they've not been tested or assessed before, so there is no catalogue or list of known flaws to look for. There are generic tests for well-known classes of issues, but these have widely varying effectiveness, and no single test or "scanner" can thoroughly assess any web application running on any framework.

An analogy with airplane manufacturing is one way to think about this. If dynamic analysis is analogous to the wind tunnel test for an airplane, think of static analysis as a sort of X-Ray. 

In the aviation world, they perform complex X-ray imaging techniques like computed tomography (CT). These kinds of tests are used to find subtle defects in materials or cases where precision tolerances are out of spec. These kinds of defects in materials and workmanship might never be detected in a wind tunnel test because the necessary conditions for failure aren't quite right or the duration isn't long enough (you can't keep the plane in the wind tunnel for years in order to perform a complete simulation of its life cycle). In the software world, we perform static analysis as a sort of "X-ray" to look for subtle flaws and imprecise tolerances in the code that might fail at some point in the future if conditions become just right.

Perhaps better known is the so-called "wind tunnel" test where the aircraft is exposed to in-flight conditions to test performance and find out what it takes to create failure. Airplane engine testing includes putting high velocity water and hail, and even dead birds, into the engine while running to see if the engine can handle unwanted input without failure. This is sort of similar to dynamic software analysis where an running application is exposed to inappropriate input in order to see if it fails or if it recovers and keeps running. Instead of hail and dead birds, during security testing, we feed software various kinds of unexpected or unwanted data and input. 

The reason these automated tests are accompanied by manual testing  - the so-called "pentesting" - is that the machines can't think and the test designers can't anticipate every possible failure state in every application. In the airplane world, there are only a few things that can reasonably be expected to interfere with a running engine - clouds, rain, hail, fog, birds, lightning, possibly supercooled liquid water, and various atmospheric gases. There just aren't that many kinds of things in the air. In the software world, abuse cases are supernumerary - there are millions and millions of permutations of unwanted data inputs that can be thrown at a running application. While not all make sense to test, and only a handful may actually cause failure, sometimes only a human can figure out just the right input to produce a failure state that is "exploitable" - meaning it creates failure in a way that yields control of the program.

Lastly, the activity known as threat modeling is also sort of poorly understood. Threat modeling is a sort of systematic design review that seeks to uncover design flaws - often design aspects based on assumptions made during the design phase that, while convenient or expedient, prove to be flawed upon closer examination. In other words, something is working as designed, but the design is insecure. An example in the aviation world would be reviewing the design of the onboard networks to see if there are any potential interconnections between control and entertainment systems that could be abused by a malicious passenger under the right conditions. Security design flaws, and their underlying assumptions, cannot usually be found by any automated test, at least not until the machines can think. 

The output of threat modeling is often more likely to be design flaws (things that are working as designed, but whose design is insecure) more than security defects (things that are insecure because they fail when exposed to unexpected conditions). 

Hopefully this explains why we have so many kinds of test cycles. Each test activity finds things the others cannot, and the only way to thoroughly test and assess product security is to use all three techniques. When the machines can think, and can create their own code, we may have code with security quality indexes that are an order of magnitude higher than we have today. It seems likely that the first thought that will be articulated by the first sentient program, if and when it arrives, will be to critique the quality if the human-written code it was created with.

Wednesday, June 24, 2015

Windows ICMP Redux: Don't Be Sad

"Have I got this straight, Jonesy? A million dollar computer tells you you're chasing an earthquake, but you don't believe it, and you come up with this on your own?"
-Captain Mancuso,
 The Hunt For Red October

This is a follow-up to a post from days of yore - Tracking Down Random ICMP in Windows. To summarize: ICMP endpoints now can often be identified with netsh traces (and yes, if you're wondering, this is a good thing).

Possibly you're wondering, at this point, why we care about something as arcane as identifying a process sending ICMP packets. I personally care about this because I hate it when there is nothing to the left of the equals sign - where the status of the case is having to say "we don't know" (something engineers hate to say, and will only say under duress and in a sort of low, pathetic tone). Tracking down the process connected to an ICMP packet stream under Windows has historically been hard and made many people sad - including myself, when I wrote the original post what seems like a lifetime ago. A wide range of replies and suggestions ranged from "use Ethereal (now Wireshark)" to "run the Sysinternals thingy" to "run that thing that does..things". Sniffers don't generally give you a process name or ID, of course, and members of the venerable Sysinternals suite like Process Monitor and TCPview don't catch ICMP due to differences in the way the protocol is implemented. I've asked Microsoft about this over the years, and talked with Russinovich at a conference in 2011 or 2012, and for a long time there wasn't a simple answer.

ICMP is a pretty good protocol option for someone who wants to use a network to do things - say scans, data exfil or even C2 - without being noticed because it tends to have free reign. Security teams rarely interfere with ICMP because networks teams tend to regard it as both critical to normal operations and somewhat shrouded in mystery (at least as to what exactly needs to be able to ping what, in order to avoid failing and taking half the business with it). With that combination, any interference is often regarded as tantamount to recklessly introducing risk of disruption and chaos. The upshot is that ICMP is often subject to few or no access controls. ICMP is absolutely vital in the devops and software maintenance fields, as anyone who has listened to network complaints from developers can attest, as these often feature a sort of customary preamble that goes something like "I cannot ping the things, so that means the firewall has taken down the network, and that makes me sad." ICMP, it would seem, is the primary network diagnostic tool for most kinds of distributed applications.

Security and network teams tend not to notice ICMP traffic unless it becomes annoying or disruptive due to quantity or volume. Security analysts and hunters may not look twice at anomalous ICMP traffic because of the familiarity or availability heuristics - analysts are often quick to assume that ICMP activity is "normal" because they see the protocol used constantly on their networks. (Interestingly, these heuristics are thought to increase under cognitive load and may explain some of the "we didn't see the alert" breach and incident scenarios. Security incident analysts increasingly fail to recognize indications and warnings when they are task saturated or working in interrupt-driven roles like support or administration). Security teams may avoid investigation of anomalous ICMP activity and instead hope it is harmless because it has been hard to reach a conclusion without a lot of tedious mucking about in a debugger or other low level tools (which tends to be annoyingly disruptive and consequently makes everybody sad). A threat actor with persistence on a network manager's Windows PC could do almost anything with ICMP with little or no expectation of being discovered and shut out.

Sometime around 2010, Network Monitor (the Microsoft sniffer for Windows) added the ability to display pids (process IDs) and image names (process names) in the trace data. This data is not in the IP protocol specs, of course, and is therefore not present in any captured packets; Network Monitor obtains this detail through some magic of the Windows API. This is incredibly useful but, as with the Sysinternals tools, ICMP endpoint process details are not captured. However, there is another way. It turns out that netsh added a "trace" command in Windows 7 which takes packet captures. This is the syntax for "netsh trace":

The following commands are available:

Commands in this context:

?              - Displays a list of commands.
convert        - Converts a trace file to an HTML report.
correlate      - Normalizes or filters a trace file to a new output file.
diagnose       - Start a diagnose session.
dump           - Displays a configuration script.
help           - Displays a list of commands.
show           - List interfaces, providers and tracing state.
start          - Starts tracing.
stop           - Stops tracing.

These netsh packet captures are taken somewhat differently than in Network Monitor - I don't know just what the differences in the implementation details are - because they (mostly) include process details for ICMP traffic. For example, here is a netsh trace of ICMP activity viewed in Network Monitor:

In this case we can see the process is nmap, which is not something generally expected of a user, even if they were the sort expected to be running tools that ping everything in sight all day long. If the process were something unknown, or unidentified, we could locate the binary and start malware analysis as usual (and round up twice the usual number of suspects).

The process is multi-step but simple and quick:

1. Take a network trace using netsh trace start capture=yes 

2. Stop the trace when finished using netsh trace stop
3. Open the capture file in Network Monitor. The tree on the left will not be organized as usual and instead contains a list of things called "NetEvent Activity ID x". You may have to search (click Find) or trawl through the data to locate the traffic you're interested in.

Thursday, March 13, 2014

Much About Doing Nothing

One of the most important functions of a security engineer to to ensure that nothing is done.

As I write this, we seem to be in a period of historic demand for security resources; every recruiter in my region seems to be searching for candidates to fill security positions. My voice mailbox fills up once a week with messages from recruiters and hiring managers looking for referrals and my inbox contains hundreds of security job descriptions. One of the side effects of this demand zone for security engineers is an uptick in horror stories, from a wide variety of organizations, about new security people who cannot triage, or even read, security vulnerability data. Lacking the ability to thoughtfully sort the important from the trivial, some simply dump PDF or Powerpoint reports containing hundreds or thousands of vulnerability line items and demand everything be immediately fixed without regard to the importance or relevance of the issues. Others demand that any issues rated "critical" or "high" become a fire drill, requiring around the clock remediation effort and / or suspension of normal business priorities, without considering the vulns in the context of their actual risk profile and attack surface. A remotely exploitable vuln with arbitrary code execution may indeed be critical, assuming it has been confirmed a true positive, and should be patched, but if it is not exposed to unauthenticated or external users, it is probably not an emergency requiring activation of incident response plans. At the same time, the number of issues that are genuinely urgent, among those that are confirmed and serious in nature, is often smaller than those which are not urgent and may be handled by vulnerability management processes on a non-emergency basis. Vulnerabilities that are not exposed to unauthenticated users; confirmed blocked by (H|N)IDS or WAF devices; or inaccessible outside their VLAN due to layer three ACLs may not be emergencies.

This is one of the worst sorts of failure modes a security engineer can experience; a Quixotic quest to eliminate all risk creates as many direct threats to the success of the business as a security incident. Sch a quest diverts valuable engineering resources, distracts business teams, and degrades the ability of the business to execute. Additional consequences may include alienation of development and business teams, making relationship building and collaboration difficult. An efficient and accurate vulnerability management program should have at least a 3:1 ratio between asks made for doing little or nothing and asks for taking action. One of the most important functions of a security engineer is arranging for development, server, app and network teams to do nothing. This probably sounds nonsensical, but consider that security vulnerability datasets contain anything up to a 10:1 ratio between non-actionable and actionable issues. False positives abound, but so-called "nontextual" or noise line items are often more numerous. Noise issues are not false positives, strictly speaking, as the test code or criteria evaluated true and did not produce a type 1 error. Rather, they are issues that are inaccurate or irrelevant because the set of conditions the test measures does not match the set of conditions where vulnerability is present with precision. For example, some of the older vuln signatures in use may simply flag versions in service banners or port numbers. A service banner cannot measure patch levels with precision and a port number may be used by anything; when high-numbered ports are flagged on Windows servers, these are often simply the result of RPC applications allocating dynamic ports. A good security engineer would research the relevant CVE to patch mapping and identify listening services using netstat or tcpview during triage of such issues in order to eliminate them from the list of action items. In the web application space, false positives and noise abound and it can be even more important for a skilled analyst or engineer to reproduce security bug candidates..the simple presence of error messages, return codes or echoing of encoded text does not a working exploit make. In many cases, development teams are acting correctly to reject results sets from web scanners that have not been triaged or confirmed by someone who knows how to use a web proxy/debugger.

For each confirmed and important vulnerability in a data sets, there may be dozens or hundreds of such false positive and noise line items. No incremental improvement is realized by "fixing" nonexistent security vulnerabilities and tasking anyone with such issues, other than a security engineer learning to triage, is a waste of time. Whenever non-actionable issues are being reviewed, or unnecessary work is being considered, the security engineer should be fighting alongside the stakeholders for the cause of doing nothing. Given the volume of false positive and noise line items that tend to exist, a security engineer doing effective triage may spend as much or more time helping stakeholders avoid unnecessary work that asking them to do necessary work. Eliminating unnecessary work, and identifying non-urgent issues that may be deferred into routine vulnerability management or development cycles on a non-emergency basis, is one of the most important services a security engineer can provide in support of the preservation of scarce resources and the need of the business to execute.  This approach may also be successful in building effective relationships with stakeholder groups, making it easier to request action when the time actually does arrive.

Wednesday, February 05, 2014

Assessing Supply Chain Risk

As I write this, there is much talk about recent reports that the Target breach originated from a third party contractor's network that apparently had remote access to the Target network. The subcontractor's network was reportedly penetrated and used to pivot into the Target network, sort of the digital equivalent of tunneling into a bank from the basement of the parking garage. This illustrates how there are few unimportant supply chain members today; the era of "we don't need to worry about security because nobody would want to hack us" is over. As we are seeing with recent data breaches, smaller organizations may be penetrated for the sole purpose of pivoting to penetrate the larger and more important organizations in the supply chain. Any organization that is a member of an important supply chain is potentially the weakest link and the largest material risk to the entire chain. Supply chain risk may manifest as software product insecurity, vendor operational and network security issues, or both. This post is more about the former - software security risk - which seems to be the larger problem set in many organizations.

Recently I sat through a number of vendor pitches and asked questions which apparently prompted some people to ask me what sort of security questions should be incorporated into vendor evaluations. Software risk assessment is still a very unstructured, subjective and slow process as we still don't have definitive standards or metrics that are universally accepted. It is not uncommon to see meetings spend more time on technical dogfights about standards, probabilities, or rankings of various bugs and bug classes than than actually assessing risk. Also problematic is the risk of covering the same ground repeatedly when inexperienced but enthusiastic security analysts conduct ad hoc or unstructured threat modeling. I've spent approximately half my career as a software vendor, and half as a consumer, and it has been very illuminating to see the security issue from both sides of the table. This post is a synopsis of my experience with successful supply chain security management.

For vendors, it is important to engage sincerely with security teams. Customer security analysts can be very disruptive and inconvenient to the product sales cycle and it can be tempting to try to circumvent these obstacles. Such attempts may occasionally succeed, but in today's security landscape, they may also result in the vendor being singled out for special attention or scrutiny, making the security checkpoint process even longer. A more repeatable approach is to find someone among your staff who can serve as a product security manager, interfacing between customer security teams and product managers in order to understand and meet expectations of security policies, compliance, and outcomes. 

Vendors may find that different sections of their customer base have larger and smaller security expectations, leading some vendors to conclude that security is a feature request to be prioritized. This approach will almost never succeed because 1) security teams, as sales execs sometimes say, "can say no to everything and yes to nothing" - they almost never have any budget authority and 2) this approach tends to signal that the vendor's security management process follows a "do the minimum" philosophy and is consequently under-resourced, relatively unsophisticated, and probably not highly effective.

When ferreting, as a customer, for supply chain and software risk I would, in addition to the usual questions, suggest the following;

It may be productive, at times, to drill down on any areas vendors steer discussion away from; these may be weak points waiting to be discovered. At the same time, it is important to avoid becoming entrenched on an issue and learn to recognize when a "rathole" is genuinely empty, or a proverbial horse is dead, and move on.

I'm also a bit of a fan of exploring problem spaces that appear, at first glance, to be empty as this has, on occasion, yielded interesting revelations. Few problem spaces are ever truly empty and those that appear so are sometimes arranged to look this way. Here again it is important to avoid becoming fixated on something that is not as interesting as it seems.

It is important to avoid taking an adversarial stance. Security efforts can succeed only when supply chain members work together. The job of the security team is to see the iceberg, not steer the ship; the duty of the security team is to identify risk and present alternatives. The question of risk appetite and residual risk tolerance, in the final analysis, is for the business stakeholders. Residual risk will never be eliminated completely without shutting down business initiatives and hobbling growth.

Avoid descending into competition for the status of smartest person in the room. Realistically, someone who has risen to be a product manager or software architect at a modern software manufacturer is quite possibly the smartest person in the room, at least among the engineers, and any such competition may be largely pointless. (This is not a self-reference; I would never claim to be the smartest person in the room, and nothing useful would result from such a claim).

Product security is driven by economic forces, as I will blog about next, and not a function of raw intelligence, so even the most brilliant software architects may have lingering security design flaws in their products worth considering. It is worth considering that a design flaw is often a very contentious thing; to be called a design flaw, an issue must generally be self-evident and obviously beyond question as to security impact.

These are the general classes of questions I ask, and have been asked, on the topic of software product security and supply chain risk.

Product security management. What does the vendor's SDL (security development life-cycle) look like? Do they undertake static or dynamic analysis, threat modeling, penetration testing (by whom)? Do they simply run a scanner twice a year? How many open and closed security bugs and design flaws do they track (most vendors will not release this, and if the answer is zero, they may be just getting started). How do customers report security issues and how are these managed? What, if any, standards, frameworks, or compliance regimes is the product designed to utilize or participate in? What, if any, third party audits or attestations exist?

Authentication and authorization: the fundamentals. How do clients authenticate? How is authorization handled? How are session fixation and brute force detected or prevented? How many auth bypass or privilege elevation bugs have been fixed (if the answer is zero, they have probably never tested). Are there any unauthenticated entry points? What do they do? Is there any code that trusts security assertions from a client, or any client side authorization code that could be tampered with?

Audit and logging. The fundamental issue here is when the CSO asks, "what happened, and who did it.." can they be answered? How rich and pervasive are logs and audit trails? Can unauthorized access be detected? Abuse or accidents by authorized users root caused? Are web services and external entry points instrumented well enough to detect abuse or attack? Can logs be transmitted to a SOC or SIEM tool for post-processing and analysis? What, if any, SIEM tools can parse the various event logs?

Data. What class of data would be present in the product or system? Does the product replicate databases, unstructured data or directory data from AD or LDAP, and how is this secured? How is data in motion secured? Data at rest? Could data loss be detected? Is TLS pervasive or are there sections that use HTTP, FTP, or the like? What encryption technologies are used and are they standards based or is some crypto code purpose-built? What, if any, auditing or testing of encryption implementation details has been performed and by whom?

Attack surface. How many entry points does the product have? What connectivity requirements do they have? Which, if any, may be Internet facing? Is there a notion of "secure by default" or is hardening left to the customer? Is there any automated security configuration management or auditing? How are patches and updates delivered, how often, and how are customers notified?

Cloud. If the product is hosted, or a cloud-like or SaaS product, how is security responsibility assigned? How are security configuration and policies managed, and by whom? How are security incidents detected and customers notified? How may security incidents have been detected and handled? (If the answer is zero, the experiential base may be short, or detection capabilities may not exist). Is there a notion of multi-tenancy and if so, how are tenant security boundaries implemented - application layer, database, or both? Has the implementation been tested or audited?

Finally, a sort of essay question: What are the principal security issues to be aware of?