Google’s Privacy Sandbox Dilemma

Publié le 21 février 2022

The Chrome Privacy Sandbox initiative triggers lots of questions for the open-web and is a real challenge for Google. Here is an overview on the matter and why I think Google is facing a dilemma.

This is the first part of an article about our view at Weborama on Google’s Privacy Sandbox initiative. The second part will be published in November.

For the sake of keeping this article readable, I won’t go into summarizing the whole story of cookies (first or third party), the (crucial) question of Data Privacy Protection, the importance for the open-web industry to keep a viable way of making money out of content, nor will I go into discussing the pros and cons of using cookies for marketing purposes (their risks regarding privacy, their advantages regarding privacy, and their benefits for the open-web industry in general).

No, I won’t do all that, and I’ll assume you already know the following:

The challenge of the Privacy Sandbox is quite big. In a nutshell, Google wants to:

  • Block any cross-domain identification (third-party cookies, and all possible workarounds, I’ll talk about that later)
  • Allow the industry to continue its business without damaging too much advertising campaign performances (user identification being today key for targeting, attribution and reporting purposes)

Knowing that all of the use-cases of the industry rely today on third-party cookies, hence, cross-domain identifiers, it’s not difficult to see why this is a really challenging project (and also not surprising Google had to postpone it for a full year).

Edge Computing is Privacy-friendly by design

At its core, the Sandbox is built around a simple idea (I’m paraphrasing Google here): cross-domain identification is the root of all evil regarding privacy. Actors should not be able to cross data they collect about web users, but should rather operate on their own properties, or access strictly anonymized and aggregated data.

At the same time, Google wants to preserve marketing and advertising use-cases in a world where any Chrome browser is indistinguishable from another (hence, with appropriate alternatives to cookies).

All functionalities of the Privacy Sandbox are based on the same design: any user-centric data has to be stored locally (within the browser itself) in a secure and non-exposed location, scoped by domain. Chrome would then do any computation needed locally and would send to the relevant actors (advertisers, tech platforms) aggregated (and possibly delayed) reports.

This way, no cross-domain identification is needed and advertising money can still flow (according to Google).

To make it clearer, let me take some quick examples: the Attribution Reporting API is a good one. Instead of using a cookie to propagate the user’s identifier across domains (and let relevant actors of the supply chain map an ad impression with a click and a conversion for instance), Chrome would be aware of the events it made (it did a click and converted, after having seen an ad), and would afterwards report this information to the relevant players (eg: ad ‘ab23’ was clicked on campaign ‘xyz78’ and triggered a conversion flagged ‘012’).

With FLoC or FLEDGE, the idea is the same: you don’t keep track on your servers of the navigation history of all users you see on your network, you let Chrome remember what kind of products it has seen (running shoes, mobile phones, 4K TV…) and to which behavioral groups it belongs to (FLoC).

In other words, with the Privacy Sandbox, we go from a centralized paradigm, based on the third-party cookie pivot, to a decentralized paradigm, based on locally stored and computed data, or, even more simply : from server-side computing to edge computing.

As a matter of fact, according to all Privacy Sandbox functionalities specified so far, Chrome instances should keep data to themselves, compute locally any kind of metrics needed, get assigned to (k-anonymity valid) groups if needed, and expose those finalized metrics only when needed, to whom needed, without ever leaking any data that would allow a cross-domain identification to happen.

Cross-domain identification is what the Sandbox wants to prevent

The whole point of the Sandbox is to prevent cross-domain identification of browsers.

Let me say this again : if it’s possible, by any means, to identify a Chrome instance on the network and to share that identifier across domains, then the Sandbox has failed and the ecosystem is going to be even worse (privacy-wise and business-wise) than what it was before (with third-party cookies).

Google has spotted that from the very beginning of the Privacy Sandbox project. See part 2.3 Mitigating workarounds of the Privacy Sandbox homepage on Chromium’s blog:

As we’re removing the ability to do cross-site tracking with cookies, we need to ensure that developers take the well-lit path of the new functionality rather than attempt to track users through some other means.

Indeed, there would be nothing worse than a false sense of security provided by a weak Sandbox that would let some actors still do cross-domain identification without cookies. Users would feel their private data (navigation history and behavior) is safe and not shared across opaque actors, but it would still be.

Such identification (without relying on cookies) is possible, and well known as “browser fingerprinting”.

Browser Fingerprinting

In web development, a (digital) fingerprint is an identifier built from bits of data that are considered specific to the browser and yet (enough) stable in time. For instance, your operating system, the version of your browser and your IP address are very valuable data to build a fingerprint for your device: they are not likely to change anytime soon and the combination of the three should be unique enough to differentiate you from your neighbor.

A good fingerprint should have good unicity (the fingerprint is unique for the device across the population) and good stability (it does not change frequently). The challenge is increasing the former decreases the latter.

Don’t get me wrong, for fingerprint trackers, the IP address is not a silver bullet: it may be unstable depending on your ISP and your device (for instance, a mobile phone connected to the Internet via a 4G network will see its public IP changing regularly). In such cases, the IP might be ignored by the fingerprinting algorithm, because it would cause the fingerprint to change too frequently, making it unstable.

On the contrary, The IP address might also “hide” several devices very well. Think about your office where hundreds of employees share the building IP address. In such a case, the IP is stable (it should not change anytime soon, it’s very likely to be a static IP address assigned by your company’s ISP to your office) but it’s not unique (it carries requests made by different browsers). Combining it with other bits of data specific to the browser would, in this case, help make it more unique.

And there are lots of other bits of data that could be used to build a fingerprint (this is called the fingerprinting surface of the browser), and as you can guess, Google has planned to limit or block all of them within the Sandbox (even though it’s impossible to entirely remove the fingerprinting surface without breaking the web).

I won’t go into details here (if you want to, refer to the Privacy Budget and the User-Agent Client Hints), but what is important to have in mind is that, in a majority of cases, the IP address is one of the most powerful pieces of information within the fingerprinting surface (it provides most of the time a valuable source of entropy): it discriminates devices against each other a lot and is (most of the time) stable in time.

For the Sandbox, disabling third-party cookies means (also) disabling IP addresses

So, blocking fingerprints only works if IP addresses are hidden from web servers, otherwise, fingerprints would still work in many cases, and this would allow a workaround to third-party cookies to survive the Privacy Sandbox.

Remember the goal of the Sandbox? At its core, it’s all about blocking any kind of cross-domain identification.

And Fingerprints are a cross-domain identification mechanism and make extensive use of IP addresses.

Hence, IP address (in)visibility is key for the Sandbox to work as expected, but we’ll see in the next article that it’s also one of the most challenging problems Google has to solve.

Voici d’autres articles qui pourraient vous intéresser