ICLab: A Global, Longitudinal Internet Censorship Measurement Platform

Abstract

Researchers have studied Internet censorship for nearly as long as attempts to censor contents have taken place. Most studies have however been limited to a short period of time and/or a few countries; the few exceptions have traded off detail for breadth of coverage. Collecting enough data for a comprehensive, global, longitudinal perspective remains challenging. In this work, we present ICLab, an Internet measurement platform specialized for censorship research. It achieves a new balance between breadth of coverage and detail of measurements, by using commercial VPNs as vantage points distributed around the world. ICLab has been operated continuously since late 2016. It can currently detect DNS manipulation and TCP packet injection, and overt “block pages” however they are delivered. ICLab records and archives raw observations in detail, making retrospective analysis with new techniques possible. At every stage of processing, ICLab seeks to minimize false positives and manual validation. Within 53,906,532 measurements of individual web pages, collected by ICLab in 2017 and 2018, we observe blocking of 3,602 unique URLs in 60 countries. Using this data, we compare how different blocking techniques are deployed in different regions and/or against different types of content. Our longitudinal monitoring pinpoints changes in censorship in India and Turkey concurrent with political shifts, and our clustering techniques discover 48 previously unknown block pages. ICLab’s broad and detailed measurements also expose other forms of network interference, such as surveillance and malware injection.

Type
Publication
The 41st IEEE Symposium on Security and Privacy

ICLab has been running and measuring Internet censorship since late 2016. We are happy to share the analyzed data that we use in our recent paper: ICLab: A Global, Longitudinal Internet Censorship Measurement Platform, accepted to the IEEE Symposium on Security and Privacy 2020.

Our data is in CSV format with the following columns:

1. filename: name of raw data file (for internal use)
2. server_t: the timestamp of when the measurement was conducted (e.g., 2017-01-01T00:03:55.797Z)
3. country: country code ISO alpha-2
4. as_number: Autonomous System Number
5. schedule_name: web test lists( i.e., Alexa global top list, CitizenLab, or Berkman center)
6. url
7. dns
8. dns_reason: true = manipulated, false = unmanipulated
9. dns_all
10. dns_reason_all
11. http_status
12. block: true = blockpages, false = normal
13. body_len
14. http_reason
15. packet_updated: true = injected, false = no injection
16. packet_reason
17. censored_updated: true = censored, false = uncensored