Thursday, November 01, 2007

Detecting Data Breaches and Sensitive Information Leaks

Recently I've found myself in a number of discussions about detecting losses of "non public information" or NPI - high value data like medical data, credit cards and social security numbers. Many products can detect NPI in transmission on its way across a network to somewhere else; snort, for example, has good regexps for all kinds of NPI including socials and cards. It's interesting that this is one of the detection capabilities that is least likely to be in place and perform well despite the fact that this capability is one that many organizations would most like to have. What makes it so hard to detect and prevent these sorts of data breaches? Let's consider the detection problem.

NPI data (credit card, social security numbers and the like) can be reliably detected by IDS or "extrusion detection" devices using regular expressions; these capabilities also exist in snort. I've used regexps like these in the past for social security numbers and bleeding snort has ones that are even better;

SSNs delimited by dashes or spaces

\b\d{3}[- ]*\d{2}[- ]*\d{4}\b

SSNs delimited by any char not a letter, number or underscore


Match 1000 or more numbers, non-delimited (for the case where the bad guys remove the delimiters prior to transmission and you have tuned snort extremely well):


And so on. The problem rapidly becomes that no matter how smart your regular expressions and / or content inspection technologies are, the possible evasion methods are
supernumerary - that is, so numerous that they're practically infinite. Data can be encoded, obfuscated and encrypted in any number of ways which make regular expression based detection impossible. Add to that the fatc that not all application protocols are ASCII (text) based; there are a vast number of obscure binary application protocols out there that are not going to cooperate with regexp based detection. What can we do? I recently participated in a discussion of the subject on the focus-IDs list, reproduced below;

From: on behalf of Craig Chamberlain
Sent: Mon 10/15/2007 3:37 PM
Subject: RE: Using Snort to find creditcard data?

This has been an area of interest for me for some time. It's very true
the regexp based detection technologies can produce high rates of false
positives and are easily evaded. It's not uncommon for data leaks to
take place over vpns; a case study like this was presented at blackhat
this year. Even without encryption, the number of possible obfuscation
techniques is quite large (and we're assuming the data is ASCII; there
are probably enough obscure back end applications with binary protocols
to keep a good sized protocol dissector development team frustrated indefinitely).

I've seen some good success combining specification based techniques -
like these regexps - with behavioral detection - such as using netflow
or other flow data, for example, to detect unexpected large or long
duration data streams headed for places that don't makes sense (e.g.
foreign networks, foreign countries or external networks with which no
business relationship exists). It seems to often be the case that
systems containing high-value data have a predictable enough network
behavioral repertoire that this kind of behavioral detection performs

This kind of behavioral detection, optionally corroborated with
available specification based detection such as regexp detects, can have
acceptably low false positive rates. Another advantage of flow data is
that it is hard to evade detection of the fact that you're moving a lot
of data; you can obfuscate and encrypt the traffic but you can't conceal
the fact that a quantity of traffic (and presumably data, if the payload
is not garbage) is being transmitted. Of course, if an obvious attack of
some sort precedes all of this - with a resulting detect or detects from
an IDS to corroborate - then confidence is again higher.


Craig Chamberlain
Principal Security Consultant |

-----Original Message-----
From: []
On Behalf Of Ofer Shezaf
Sent: Tuesday, October 02, 2007 7:25 AM
Subject: RE: Using Snort to find creditcard data?

All the answers where good but also partial as the subject is far from

There are few aspects to detecting credit card numbers on the network,
and I will try to address them:

1. Matching credit card numbers
2. Handling false positives
3. Evasion
4. Logging

Matching Credit Card Numbers
Valid card numbers:

1. Are 13-16 digits long. This is easy to detect using regular
expressions but may result in a lot of false positives. A lot of IDs are
in this range.

2. Conform to the LUAN checksum function. Being a checksum function it
matches 1 out of 10 numbers in the range. Since many times applications
that use numbers of this length use an entire range, there will still be
false positives. LUAN cannot be verified using regular expressions and
would require code.

3. Have certain prefixes which were assigned to issuers. A pretty good
table of assigned prefixes can be found in Wikipedia, but I'm not sure
it is comprehensive (
Prefixes further reduce false positives and can be implemented using a
(complex) regular expression. Using prefixes introduce a risk of false
negatives due to omission of less common prefixes. For example we have
not been aware until recently of Bankcard from Austria. This is
especially a problem internationally.

False positives

The problem is that the above rules generate a lot of false positives.
Most false positives are related to normal application traffic using
long ASCII numbers. Such an application would usually use a range and
therefore hit some valid numbers.

Since the PCI requirement is for "Encrypt transmission of cardholder
data (only) across open, public networks", another source of false
positives are applications that transmit credit card numbers
intentionally and legally.

The solution for such false positives would be exceptions, which I'm not
sure Snort is the best solution for and would require an application
layer IDS. A network layer exception would be limited to addresses and
ports while a good exception would be by a specific property of the
transaction such as URL and parameter (for HTTP traffic). For web
traffic I would use for example something like ModSecurity. But I'm

It is important to note that any such mechanism will detect only
erroneous use of credit card numbers. Even the simplest transformation
function on the numbers will enable them to bypass detection, so most
malicious usage would not be detected.

There is also an issue with leakage through encrypted channels, since
PCI requires encryption, leakage would many times be encrypted. IDS
limitations regarding encrypted traffic have been discussed extensively
here (
and elsewhere.

Assuming that we did everything right and built a system for detecting
credit card numbers on the network, we cannot keep the number as we
would violate PCI in the detection system. Solutions are:

(a) Encrypt all collected information

(b) Mask the credit card number

~ Ofer Shezaf

Re: Using Snort to find creditcard data? Oct 19 2007 06:59AM
Siim Põder (siim p6drad-teel net) (1 replies

-----Original Message-----
From: listbounce (at) securityfocus (dot) com [email concealed] [mailto:listbounce (at) securityfocus (dot) com [email concealed]] On Behalf Of Siim Põder
Sent: Friday, October 19, 2007 2:59 AM
To: Craig Chamberlain
Cc: focus-ids (at) securityfocus (dot) com [email concealed]
Subject: Re: Using Snort to find creditcard data?


Craig Chamberlain wrote:
> Good point; what I'm suggesting is that while it's relatively easy to
> hide or obfuscate the data itself, it is hard to conceal the fact that
> data - or packets - are being transmitted, possibly using a
> recognizable application protocol, to an unexpected destination, which
> can be a useful last-ditch detection mechanism when the other methods
> fail - or can be a useful corroboration when correlated with the other
> detect data.

There is bound to be some sort of legitimate production traffic. For example, if there are https connections coming in to a specific machine and specific port. You can detect if that machine starts sending out data on its own or starts accepting connections on another port.
However, if the same port starts serving credit card numbers
(obfuscated) or even hides the credit card numbers in tcp sequence numbers (or does something even more subtle as serving them by changing the case of "A" letters in http connections from certain addresses) the movement of data should be extremely hard to detect.


Re: Using Snort to find creditcard data? Oct 19 2007 06:59AM
Siim Põder (siim p6drad-teel net) (1 replies)RE: Using Snort to find creditcard data? Oct 19 2007 03:45PM
Craig Chamberlain (craig chamberlain Q1Labs com)
What I'm describing in this bit is actually a behavioral detection technique rather than the specification or regexp based method (though the combination of the two is often preferable, where available) as the detection failure scenarios data inspection are endless, as you point out. What I'm suggesting is that while methods and scenarios for obfuscating or concealing data beyond detection or inspection are supernumerary, there are a few elements that can usually be reliably detected using flow data, assuming the reporting devices themselves have not been compromised;

1. network transmission took place with between two IP sockets
2. some number of bytes and packets were transmitted, which can be measured
3. the destination address was or was not part of the organization which owns the source
4. the destination address is or is not within expectations (e.g. is it a foreign country or organization with which no business relationship exists)
5. (possibly, if you have some packet content samples with the flow data) the application protocol appears to be a known application or network protocol
6. this apparent application usage is or is not within expectations - and is or is not typical

With this information, behavioral detection can sometimes find data leaks in the form of anomalous network behavior such as the appearance of a new application protocol e.g. a vpn, ssl or ssh connection - especially on a non standard port - with a remote destination which is unexpected (while the encrypted data itself is beyond inspection, the application protocol itself may be recognizable); or a direct SQL connection from a client desktop, where they normally connect through a middleware application, or something else.

We use a technique we call anomaly detection in the QRadar tool to detect appearance of new behaviors for selected high-value systems and networks. useful either as a corroboration to methods like regexp based inspection or as a potential method of finding information leaks that evade these detection methods. Of course, there are scenarios where new behaviors result from normal changes; none of these methods are perfect, but they are all useful. Combining them through correlation can sometimes improve accuracy of detection even further. In my experience, the best method of finding misuse patterns like data leaks is through sophisticated event correlation - either manual or programmatic - though it needs to be done programmatically in order to scale.

Craig Chamberlain
Principal Security Consultant
craig (at) q1labs (dot) com [email concealed] |

No comments: