We recently announced our investment in a marketing analytics platform called Jumpshot. https://blog.avast.com/2015/05/29/avast-data-drives-new-analytics-engine/
It’s a very exciting investment for us for two reasons. First, because they create really interesting and unique market insights. Go to jumpshot.com
if you want to see some of it – there’s a free trial that anyone can sign up for. And second, because they do it using a proprietary algorithm that strips all the PII out of the data they use. It’s the only data stripping tool we’ve seen that does that successfully.
Here’s how Jumpshot works:
Data is collected on computers and Android devices through the browser. Each record contains a set of fields that help Jumpshot algorithms assign the clickstream data appropriately. These fields include:
- Installation identifiers (proprietary identifiers that do not contain any PII)
- URL being visited
- Referral URL (if this exists)
- Window identifier
- Tab identifier
- Additional fields for processing purposes
In reality, the information Avast passes on to Jumpshot looks like this:
- Identifier: 00002437-705b-4bc6-b062-54b7ea511c93
- URL being visited: http://www.cnn.com/US/?hpt=sitenav
- Referral URL: http://edition.cnn.com/
- Window identifier: 3
- Tab identifier: 42
Prior to processing, all records are automatically scanned for PII, and all PII parameter values are removed from the raw data. To strip PII, Jumpshot uses a proprietary algorithm that calculates multiple statistical features for parameters on all known websites. Based on these statistical values, only parameters that are proven not to be PII are whitelisted and their values are kept. All parameter values that are not whitelisted are stripped in the process, which leaves those parameter values overwritten by the word “REMOVED”. The stripping of PII is done on the Avast premises in Prague, to ensure that the PII never leaves our hands.
Let’s have a look at an example. With a shopping site like Amazon, the URL before stripping contains some PII:
The algorithm automatically replaces the PII with the word REMOVED in order to protect our users’ privacy, like this:
The stripping processes doesn’t end here, though. Next is aggregation. Data processing is performed once a day in a cascade of data transforming and aggregating map-reduce jobs. Aggregations are typically applied on a per-domain (website) and per-URL (web page) basis. To further protect our users‘ privacy, we only accept websites where we can observe at least 20 users. This ensures that no reverse engineering is possible on the aggregated data – there’s nothing that can lead back to a specific user. All aggregated data is then stored in an RDBMS (currently PostreSQL) database on a per-domain and keyword basis.
These aggregated results are the only thing that Avast makes available to Jumpshot customers and end users.
Users can remove themselves from the system in two ways – by unchecking the “Statistics“ box in the Avast browser add-on settings (see attached picture), or by sending an email to customer support requesting to have their information deleted. If a user wants to be deleted, the system automatically blacklists their user ID from all data transforming activities.
By focusing on protecting our users, we ensure that the data Jumpshot customers get is accurate because the larger the data pool, the more statistically valid the data customers get to work with. So Jumpshot has a vested interest in protecting our users' privacy.
If you have any questions, please don't hesitate to ask.