Nine to One, Baby, One in Nine: Surveillance by the Numbers

Published on July 8, 2014

There’s a great deal of interesting material in this weekend’s big Washington Post story on collection of Internet communications under §702 of the FISA Amendments Act. But in a way, the single fact that has gotten the most attention—that 90% of §702-acquired communications in a trove provided by Edward Snowden were sent by someone other than the target—is also the least surprising. Or rather, it’s surprising only in that the percentage is not much higher than that. Doubtless we’ll hear as much from the intelligence community’s friends in Washington once they’ve roused themselves from their post-cookout torpor and read the piece. But the number is as low as it is largely because it represents only a tiny processed and minimized fraction of the NSA’s full intake. What’s notable, however, is that this means the Post‘s back-of-the-envelope calculation of persons affected by the agency’s dragnet is also almost certainly too low by several orders of magnitude.

First, consider why you’d actually expect a much higher proportion of non-target communications. Suppose the NSA has tasked your e-mail address for collection, and we want to characterize the communications they intercept using the Post‘s method, looking at the “FROM” line of every message and tallying those originating with targets versus non-targets. All the e-mails you send will be from one target account: yours. Every unique e-mail account from which you receive a message, unless it also happens to be tasked for collection, will count in the “non-target” column. Now, many ordinary e-mail users will receive messages from nine other people in the span of half an hour, never mind several years, even after stripping out sales pitches for herbal Viagra. So as applied to the raw intake on an e-mail account, you’d expect the Post’s methodology to yield not a ratio of nine-to-one, but of hundreds, if not thousands, to one.

The Post is not, however, looking at raw intake, but at a small sample of the minuscule fraction of the total intake that is ultimately processed and minimized by NSA analysts. In a white paper released in August of 2014, NSA claimed that only .025% of the total traffic they “touch” is ultimately flagged for human review. One assumes the ratio is higher for collection of communications content under §702, especially under PRISM, but we’re still almost certainly still talking about a fairly small percentage of total collection that ends up being processed and disseminated—which means the nine-to-one ratio, which also excludes “minimized” (but mostly retained) U.S. person communications, is for communications that have already, in a sense, passed through a “relevance filter,” though as the Post makes clear, many are not in any intuitive sense actually relevant to intelligence purposes.

Other factors reducing the total: A surprisingly high proportion of these communications—about 75% of the total—are classified as “instant messages,” which may be explicable if it includes text-message substitutes like Apple’s iMessage or WhatsApp, and especially if every line in an exchange is effectively counted as a distinct “message,” by which tally even a relatively brief conversation could easily comprise dozens of “communications.” Most of us, after all, have e-mail correspondence with hundreds or thousands of acquaintances and strangers over the course of a year, but engage in IM conversations with a far smaller pool of close friends and co-workers. Finally, a “target package” included with the Post story notes that the single target in question, Muhammad Tahir Shazad, “maintains over 60 DNI selectors,” which translates roughly to “Internet accounts tasked for collection.” If genuine counterterrorism or counterintelligence targets maintain large numbers of disposable “burner” accounts (in an attempt to thwart surveillance) used almost exclusively for exchanging messages with a tiny pool of co-conspirators, themselves also likely targets, and if these are also (understandably) the most relevant to intelligence purposes and therefore the most likely to be reviewed, then the target/non-target ratio in the pool of messages reviewed and processed by analysts is going to be artificially much, much higher than the ratio in the unprocessed raw intake database.

Bracketing that for a moment, let’s look at the napkin-math the Post employs to extrapolate from their sample:

In a June 26 “transparency report,” the Office of the Director of National Intelligence disclosed that 89,138 people were targets of last year’s collection under FISA Section 702. At the 9-to-1 ratio of incidental collection in Snowden’s sample, the office’s figure would correspond to nearly 900,000 accounts, targeted or not, under surveillance.

But this is not quite right. As ODNI’s rather misleading transparency report does at least acknowledge, they are counting “targets,” which includes both natural persons and groups or corporations, while the Post is effectively counting accounts or selectors. As the “target package” quoted above makes clear, even an individual human target may be associated with dozens of separate accounts. For a corporate or other collective “target,” the number may easily run into the thousands. So even ignoring the filtering that processed and minimized communications have already gone through, and treating “nine” as the correct multiplier for present purposes, the correct equation here is not (89,138) x (9), but (89,138) x (average selectors tasked per target) x (9). Translating this to actual human beings is, of course, somewhat tricky, but it seems far less likely that innocent non-targets will be using dozens of burner accounts to conduct intimate but innocuous conversations.

Even restricting ourselves to the pool of reviewed and minimized communications, then, the correct count for “accounts affected” by the NSA’s surveillance dragnet is almost certainly not a “mere” 900,000, but many millions. It’s dangerous to try to extrapolate too far in this way, because social networks overlap, and so linear multiplication is not necessarily a reliable way of assessing how many new accounts are affected by expanded surveillance: At some point monitoring additional nodes on the network yields, so to speak, diminishing marginal returns by this metric. Still, considering that the “non-target multiplier” for the raw intake database is almost certainly significantly higher than nine, we are plausibly talking about hundreds of millions of accounts affected globally. No doubt this yields some useful intelligence, as searches under general warrants invariably will. But it is certainly worth asking whether creating an architecture of collection on this almost inconceivable scale is a necessary, proportionate, or indeed, sane means of acquiring that intelligence.