"Assessing the Privacy Benefits of Domain Name Encryption" presented at Internet Engineering Task Force 110 Meeting

IETF 110

Abstract: As Internet users have become more savvy about the potential for their Internet communication to be observed, the use of network traffic encryption technologies (e.g., HTTPS/TLS) is on the rise. However, even when encryption is enabled, users leak information about the domains they visit via DNS queries and via the Server Name Indication (SNI) extension of TLS. Two recent proposals to ameliorate this issue are DNS over HTTPS/TLS (DoH/DoT) and Encrypted SNI (ESNI). In this paper we aim to assess the privacy benefits of these proposals by considering the relationship between hostnames and IP addresses, the latter of which are still exposed. We perform DNS queries from nine vantage points around the globe to characterize this relationship. We quantify the privacy gain offered by ESNI for different hosting and CDN providers using two different metrics, the k-anonymity degree due to co-hosting and the dynamics of IP address changes. We find that 20% of the domains studied will not gain any privacy benefit since they have a one-to-one mapping between their hostname and IP address. On the other hand, 30% will gain a significant privacy benefit with a k value greater than 100, since these domains are co-hosted with more than 100 other domains. Domains whose visitors' privacy will meaningfully improve are far less popular, while for popular domains the benefit is not significant. Analyzing the dynamics of IP addresses of long-lived domains, we find that only 7.7% of them change their hosting IP addresses on a daily basis. We conclude by discussing potential approaches for website owners and hosting/CDN providers for maximizing the privacy benefits of ESNI.

IETF 110 - Measurements and Analysis for Protocols Research Group Q&A

I am very thankful for all attendees of IETF 110 - MAPRG for their constructive comments and questions. Due to the limited amount of time we have for the session, I could not take all questions from the attendees that were in the queue, and also could not reply to all questions asked via the public live chat channel. Therefore, I have tried my best to note down as many comments and questions as possible before the session ended. Please find my responses below and do not hesitate to reach out to me via email or Twitter for further discussion.

1. Eric Rescorla: I don’t see how redirecting to a malicious target works, because the TLS handshake will just fail. I mean, if you have a network-level vulnerability, perhaps?

–> A1: Thank you, Eric for pinpointing this. My apologies for the miscommunication, by redirecting to malicious host on Slide 3 I mean via DNS tampering. With TLS in place, redirection to malicious host should not happen and/or succeed as it will just simply break the connection due to cert errors.

2. Erik Nygren: Note that “ESNI” isn’t in TLS 1.3. It is a separate optional and in-progress draft (now called “ECH”). This “ESNI is in TLS 1.3” has caused a bunch of confusion in media reports claiming countries are blocking TLS 1.3 when they were just blocking the ESNI/ECH extension.

–> A2: Thank you, Erik for reminding us about the optional use of ESNI in TLS1.3 and the reworked version of ESNI, now called ECH. We conducted this study before ECH was introduced, (my bad) I should have updated the slide to reflect the new work-in-progress draft of ECH. And yes, totally agree on the confusion in media reports. Especially, China is not blocking the TLS 1.3 standard as a whole, but only targeting TLS handshakes whose SNI extension is encrypted.

3. Andrew Campling: An interesting presentation. In thinking about domain name encryption, did you consider the possible privacy and security downsides of centralization too?

–> A3: Thank you, Andrew for the question. By “centralization”, I first misunderstood your question as “centralizing all DoTH queries into a single trusted resolver”, which I do believe is not good for privacy. We have proposed a resolution strategy to distribute queries across multiple resolvers in a manner that no single resolver can learn a user’s entire browsing history. Going back to the “centralization” in the sense of web co-location, I also agree that the centralized web hosting in a handful of major providers (e.g., Google, Cloudflare) is worrisome to some extent. In another study published at SIGCOMM CCR, we find that the Web has indeed become even more centralized nowadays, with the vast majority of websites are hosted by major providers whose hosting infrastructures are well provisioned (i.e., high performance, resilient to DDos, etc). So our thoughts on Slide 22 are based on this already-centralized nature of many websites on the Internet. In other words, if there are already many websites hosted by a given provider, this provider should try to host these websites in a manner that will help to maximize the privacy benefit of domain name encryption technologies, by grouping more websites under the same IP and/or dynamically rotate domain-IP mappings to further improve privacy. Other attendees (Jari Arkko, George Michaelson) also pointed out that it really depends on one’s privacy model, i.e., who is your attacker? who do you trust? which website is considered as (not) sensitive?. Answering these questions would help to guide users and site owners to make their browsing/hosting decisions better.

4. Dave Plonka: agreed it doesn’t change these results, but the notion of blocking ESNI vs TLS on the web is a key reminder, and it could be measured (in future work).

–> A4: Yes, we are actually conducting a follow-up work to comprehend these blocking behaviors in the wild. I will definitely submit another talk to a future MAPRG meeting again once the measurement results are finalized. From some preliminary results, we do see some censors are actively targeting domain name encryption technologies to hinder their adoption.

5. Jason Livingood: It seems the main concern is the network sees destination IP addresses but w/o that I am unclear how a network can forward packets to the correct destination. And I am unsure the recommendation in the PDF is the right one. You first have to accept that TLS & ECH do not bring “meaningful privacy benefits”. I think TLS & ECH actually do bring incredibly meaningful privacy benefits. So the recommendation then is to use hosting providers with a high number of sites in a given IP address. Which could have the side effect of increasing centralization and could even drive further privacy issues - just at a different layer of the stack

–> A5: Exactly! Once a full deployment of DoTH+ECH is in place, destination IP address is the last piece of metadata that is visible to on-path observers, which is hard (impossible?) to hide. One may suggest the use of Tor, I2P, VPN, etc, but these tools address a much stronger, yet different, privacy threat model, which is orthogonal to the privacy risk that domain name encryption technologies aim to address.

We also strongly believe that domain name encryption is an important step to enhance web users' privacy as we have seen an endless number of cases where ISP and corporate network operators have been massively tracking their users' browsing history, for various purposes including monetization, based on the domain name information exposed via DNS requests/queries and SNI field fo TLS handshakes.

Regarding the concern about “centralization of web hosting” that we mention on Slide 22, please refer to the response A3 above.

6. Eric Rescorla & Hugo Salgado: I’m not sure why we think that just having 10 domains per IP does not provide material privacy? Like, if it’s “facebook.com” and “telegram.com” that’s fantastically useful.

–> A6: When conducting this work, we have asked ourselves this same question too. Since we only look at domain-IP mappings obtained via active DNS measurement in this study all co-hosted domains per IP are considered to be equally significant. However, one domain can be more popular with more visitors at different geographical locations compared to other co-hosted domains. Therefore, we hypothesized a threshold of 100 for the co-location degree to provide an ideal privacy benefit. Nonetheless, when it comes to fingerprinting an entire website based on all IP addresses of servers contacted to load different resources of that website, we find (in a follow-up work) that even when a website is co-hosted with more than 100 other websites, we could still precisely fingerprint it based solely on the sequence of IPs contacted when fetching the site’s resources.

7. Lars Prehn When you take a look at the median colocation value for IPs, would your takeways change if you took a look at the min/max? Especially the latter would be interesting to me as it suggest a 1-to-N mapping for at least some part of the world.

–> A7 From our data we see that median and mean values are similar and result in the same observation. This is because when a domain is hosted on multiple IPs, these IPs are often shared by a similar number of domains. In other words, the difference between min and max values are not statistically significant, thus the takeaways would stay unchanged.

8. Alissa Cooper: how do you assess K for understanding K-anonymity in this case?

–> A8 We estimate the K-anonymity value of each domain by the number of domains co-hosted with it. If example.com is host on IP1 with 30 other domains, then k=30. And if example.com is hosted on multiple IPs, we will take the median of k across all of these IPs. Slides 11 and 12 provide the details of this assessment.

9. Jari Arkko To clarify Ekr’s comment about our document wrt distributing queries, we found that there are strategies that are bad (e.g. round robin) but also that there are strategies that are potentially good (e.g., pick a different resolver for different clients, or different base names). None of this is necessarily straightforward, but there are potential benefits to be had, but one does have to avoid the most obvious and surprisingly bad algorithm :-)

10. Eric Rescorla Different base names doesn’t work because it’s very common to have sites use multiple ETLD. If you have something which you can demonstrate will work, I’m of course interested in hearing about it

–> A9&10: we have investigated different distribution strategies in our MADWeb ‘20 paper, and indeed found that round robin is not ideal for privacy. It may even do more harm than help as eventually all resolvers will learn everything about a user’s browsing history. The concern about ETLD that Eric raised is valid too. Perhaps, we can try splitting queries based on the type/category of domain names. As we continue conducting follow-up work on this topic, we will definitely reach out for your feedback in the future.

11. Erik Nygren I’m curious if there is any difference in IP sharing between IPv4 and IPv6. (There is somewhat less of a need to rely on SNI with IPv6, so there are at least some sites with shared IPs with IPv4 but unique IPs for IPv6.)

–> A11 Totally agree on this point, once IPv6 is fully adopted it would make virtual hosting and the use of SNI less relevant since web co-location to save IP space may not be necessary any more. From our dataset, we see only less 15% of domains currently supporting IPv6, so we did not include the analyses for IPv6 in our study for the simplicity in presenting our results. This actually aligns with another earlier talk in the MAPRG session by Paul Hoffman, where he also finds that only 14% of domains supporting IPv6 in his dataset curated by crawling Wikipedia. Once the adoption of IPv6 becomes more prevalent, it will be definitely useful to reconduct our study from the perspective of IPv6.

Related