--- A collection of traces of web requests and responses over an encrypted SSH tunnel. The collection spans traces of connections to 2000 sites, collected four times a day over several months from February 2006 through April 2006. Each connection was encrypted; the traces include only the TCP headers, and not the payload. More details are in the paper and README. --- The data in the following files: md5sum filename 996f7aabd164dcb7e7e8911e6aafb1d3 pcap-logs-0.tar.bz2 c52cf294470a22dd35c046a24ba06cfc pcap-logs-1.tar.bz2 43afebc21d5303bd8ceb8aa682178710 pcap-logs-2.tar.bz2 32e27f86827b59c1c0a487efdb249149 pcap-logs-3.tar.bz2 9bcfb8668ffcd9388509edcc785ef3b3 pcap-logs-incomplete.tar.bz2 were collected as described in: @inproceedings{liberatore06identifying, title = {Inferring the Source of Encrypted HTTP Connections}, author = {Marc Liberatore and Brian Neil Levine}, booktitle = {Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS 2006)}, year = {2006}, www_pdf_url = {http://prisms.cs.umass.edu/brian/pubs/liberatore.ccs2006.pdf}, } If you use this data set in your own published research, please refer to it by citing the above paper. Each tarball contains files created by tcpdump, which will unpack into directory named "pcap-logs". Within this directory are subdirectories, one per collection run, named with the time of the start of the collection run. Each subdirectory contains 2000 files, one per site, named with the time the trace for the site occurred and a unique site identifier. There are holes in the sequence, due to various failures during the collection: do not assume it is continuous from start to end. In particular, collections with errors are in subdirectory named "incomplete" one level below "pcap-logs", the contents of which is in the correspondingly-named tarball. The site identifier is consistent across all traces. Sites 11 and 987 were in the same subdomain (cs.umass.edu) as the collecting machine. Sites 56, 59, 190, 338, 542, 569, 614, 878, 1154, 1407, 1488, 1496, 1522, 1696, 1740, and 1824 were in the same domain (umass.edu) but a different subdomain from the collecting machine. The traces are of traffic sent over an SSH tunnel to a web proxy on a separate host directly connected via 100Mbps Ethernet. Marc Liberatore liberato@cs.umass.edu 29 September 2006