(where you can thrill to the adventures of a detection science lead)
I've...seen things...you people wouldn't believe. Attack ships on fire off the shoulder of Orion; I watched c-beams glitter in the dark near the Tannhäuser Gate...
Tuesday, January 03, 2017
Spacefolding Redux: Increasing Hunting Velocity and Mean-Time-To-Know (MTTK)
In our last post, we took a look at traditional security incident response vs. the possibility to dramatically increase security velocity (which I affectionately nicknamed “spacefolding”).
We viewed this through the lens of a conventional response timeline that can take hours and days — versus seeing into exactly what occurred and decreasing the Mean Time-To-Know (MTTK) for a security incident -- because all of the relevant information is visible and available to you.
In this post, we’ll take this premise into a real-world example that may be familiar to many organizations running instances on AWS.Consider a routine scanning abuse complaint as an example investigation. When an EC2 instance is observed to be scanning another server, AWS security will issue an abuse report to the instance owner — a sort of admonishment that your instance has been naughty and its behavior must be dealt with.
These reports are typically very terse and may include few details other than the destination port count was exactly one thousand...One thousand ports is exactly the number targeted by a default nmap (the network mapper) scan, and we can surmise that an unauthorized nmap scan is the likely explanation. How do we ascertain this? We need to ask and answer several questions:
What? Was the cause of the activity indeed a scanner like nmap or was it a misbehaving application? The former is a case for security; the latter a case for the application owner.
Who? If it was a scanner, which user ran the scan? Who did it?
Why? Why did a user run a scanner? Was it really them or did someone else login as them using their password?
Answering these questions, in most organizations, takes lots of time and effort.
First, we have to locate the offending EC2 instance and identify whose EC2 account it is running in.
Next, we have to identify the instance and / or application owner and ask them if they can explain this behavior. If the answer is No, as it probably is, we have to obtain keys to the instance in order to login and investigate ourselves. By this time, the running state we could use to solve this mystery has expired and is no longer present for us to observe — the scanner has stopped, its network connections have expired, and the user who did it is no longer logged in.
We set about examining system logs and find nothing of significance because ephemeral events like process execution and network activity do not typically leave traces in system logs because this level of detail would grow the logs until they swamped the file system.
If we’re lucky, there may be some shell command history that we can use to identify which user ran the nmap scanner and scanned the complainant server. If we’re unlucky, we may have to examine authentication logs for the entire day, or more, and question each user one by one until we eventually learn that a user did indeed run the nmap scanner while troubleshooting network connectivity to a remote instance. The user, a support technician, forgot to specify a port parameter and accidentally ran a default nmap scan which covers a thousand common ports, which was flagged by the EC2 security team.
Figure 1. Syscall events reveal that the network activity came from the nmap scanner.
Figure 2. Syscall events identify the original command run by the user who invoked the nmap scanner.
Figure 3. Unified authentication events identify where and when the user logged in.
There IS a Better Way
How does using Threat Stack in a routine case like this improve our Mean-Time-To-Know?
With Threat Stack, our team can easily replay system calls at the time of the scan and answer our first two questions above — What? and Who? — using the resulting data. This would allow us to fast forward to step three — Why? — and simply ask the user why she ran this command.
The time savings in even this routine case are significant — minutes instead of hours. This is a velocity increase of a factor of sixty! In more complex cases, I project, the velocity increase may range as high as 200 times. Increased velocity provides blue teams with a tactical advantage; and as blue teams will tell you, they will gladly exploit any tactical advantage they can, because too often the attackers have the advantage. The ability to detect and respond closer to the speed of threats will provide a massive increase in productivity for overloaded security incident response teams.
For more advanced threat hunting teams, a velocity increase provides the ability to disrupt threats, before significant damage is done, instead of simply detecting and responding to losses that have already occurred in the past. This would be the secondary, and probably much larger, benefit of increased velocity.
A Final Word . . .
Spacefolding, for our purposes in this case, refers to a platform like Threat Stack, and the benefit is reducing MTTK. While we are still constrained by the laws of physics and spacetime, we can still significantly impact our response velocity, and potentially disrupt attackers using a purpose-built platform.