ICLab: A Global, Longitudinal Internet Censorship Measurement Platform

Abstract

Researchers have studied Internet censorship for nearly as long as attempts to censor contents have taken place. Most studies have however been limited to a short period of time and/or a few countries; the few exceptions have traded off detail for breadth of coverage. Collecting enough data for a comprehensive, global, longitudinal perspective remains challenging. In this work, we present ICLab, an Internet measurement platform specialized for censorship research. It achieves a new balance between breadth of coverage and detail of measurements, by using commercial VPNs as vantage points distributed around the world. ICLab has been operated continuously since late 2016. It can currently detect DNS manipulation and TCP packet injection, and overt “block pages” however they are delivered. ICLab records and archives raw observations in detail, making retrospective analysis with new techniques possible. At every stage of processing, ICLab seeks to minimize false positives and manual validation. Within 53,906,532 measurements of individual web pages, collected by ICLab in 2017 and 2018, we observe blocking of 3,602 unique URLs in 60 countries. Using this data, we compare how different blocking techniques are deployed in different regions and/or against different types of content. Our longitudinal monitoring pinpoints changes in censorship in India and Turkey concurrent with political shifts, and our clustering techniques discover 48 previously unknown block pages. ICLab’s broad and detailed measurements also expose other forms of network interference, such as surveillance and malware injection.

Publication
The 41st IEEE Symposium on Security and Privacy

ICLab has been running and measuring Internet censorship since late 2016. We are happy to share the analyzed data that we use in our recent paper: ICLab: A Global, Longitudinal Internet Censorship Measurement Platform, accepted to the IEEE Symposium on Security and Privacy 2020.

Our data is hosted on several platforms for public access. Please contact us if you encounter any issue when downloading the data.

Our data is in CSV format with the following columns:

  1. filename: name of raw data file (for internal use)
  2. server_t: the timestamp of when the measurement was conducted (e.g., 2017-01-01T00:03:55.797Z)
  3. country: country code ISO alpha-2
  4. as_number: Autonomous System Number
  5. schedule_name: web test lists( i.e., Alexa global top list, CitizenLab, or Berkman center)
  6. url
  7. dns
  8. dns_reason: true = manipulated, false = unmanipulated
  9. dns_all
  10. dns_reason_all
  11. http_status
  12. block: true = blockpages, false = normal
  13. body_len
  14. http_reason
  15. packet_updated: true = injected, false = no injection
  16. packet_reason
  17. censored_updated: true = censored, false = uncensored