Google Analytics data obfuscation and redundancy – News Couple
ANALYTICS

Google Analytics data obfuscation and redundancy


You may not have known this, but there is a really useful Google Analytics demo account that you can use to check out how Google Analytics works in fact Business context (data from Google Merchandise Store). However, you can access the account with no more than just reads being able to. This is annoying if you want to customize the setting.

Don’t worry, I have a solution for you! Harness the awesome power of customTask, you can create a duplicate of the data collected in Which A website where you can modify the tracking (eg via Google Tag Manager). Even better, the data will be vague Using the dictionary of English words (you can edit this list), and hash each string in the payload predictably against that dictionary.

As always, you can find this solution in my country customTask Builder.

customTask builder

Thank you very much to Jaco OgalettoMy brilliant fellow 8-bit developer for sheep. He came up with a string replacement algorithm.


X


Simmer . Newsletter

Subscribe to the Simmer newsletter to get the latest news and content from Simo Ahava right in your inbox!

How to set it up

You will need to fetch the latest version of the code from the customTask Builder tool. See also the instructions on how to publish a file customTask.

In Google Tag Manager, the file custom javascript variable It will end up looking for something like this:

function () {
  // customTask Builder by Simo Ahava
  //
  // More information about customTask: https://www.simoahava.com/analytics/customtask-the-guide/
  //
  // Change the default values for the settings below.

  // obfuscate: Obfuscates the entire hit payload (using a dictionary of words consistently) and dispatches it to the trackingId you provide.
  // https://bit.ly/2RectUl
  var obfuscate = {
    tid: 'UA-12345-1',
    dict: ['tumble', 'noble', 'flourish', 'abandon', 'liberal', 'team', 'conflict', 'collar', 'tiger', 'stun', 'grace', 'resource', 'phantom', 'imagine', 'information', 'hall', 'sweet', 'agriculture', 'bingo', 'relative'],
    stringParams: ['uid','ua','dr','cn','cs','cm','ck','cc','ci','gclid','dclid','dl','dh','dp','dt','cd','cg[1-5]','linkid','an','aid','av','aiid','ec','ea','el','ti','ta','in','ic','iv','prd1,3id','prd1,3nm','prd1,3br','prd1,3ca','prd1,3va','prd1,3cc','prd1,3cdd1,3','tcc','pal','col','ild1,3nm','ild1,3pid1,3id','ild1,3pid1,3nm','ild1,3pid1,3br','ild1,3pid1,3ca','ild1,3pid1,3va','ild1,3pid1,3cdd1,3','promod1,3id','promod1,3nm','promod1,3cr','promod1,3ps','sn','sa','st','utc','utv','utl','exd','cdd1,3','xid','exp','_utmz'],
    priceParams: ['tr','ts','tt','ip','prd1,3pr','idd1,3pid1,3pr'],
    priceModifier: Math.random(),
    medium: ['organic', 'referral', 'social', 'cpc'],
    replaceString: function
    init: function()var c=[];obfuscate.dict.forEach(function
  ;

  // DO NOT EDIT ANYTHING BELOW THIS LINE
  if (typeof obfuscate === 'object' && typeof obfuscate.init === 'function') obfuscate.init();

  var readFromStorage = function (key) 
    if (!window.Storage) 
      // From: https://stackoverflow.com/a/15724300/2367037
      var value = '; ' + document.cookie;
      var parts = value.split('; ' + key + '=');
      if (parts.length === 2) 
        return parts.pop().split(';').shift();
      
     else 
      return window.localStorage.getItem(key);
    
  ;

  var writeToStorage = function (key, value, expireDays) 
    if (!window.Storage) 
      var expiresDate = new Date();
      expiresDate.setDate(expiresDate.getDate() + expireDays);
      document.cookie = key + '=' + value + ';expires=' + expiresDate.toUTCString();
     else 
      window.localStorage.setItem(key, value);
    
  ;

  var globalSendHitTaskName   = '_ga_originalSendHitTask';

  return function (customTaskModel) {

    window[globalSendHitTaskName] = window[globalSendHitTaskName] || customTaskModel.get('sendHitTask');

    customTaskModel.set('sendHitTask', function (sendHitTaskModel) {

      var originalSendHitTaskModel = sendHitTaskModel,
          originalSendHitTask      = window[globalSendHitTaskName],
          canSendHit               = true;

      try {

        if (canSendHit) 
          originalSendHitTask(sendHitTaskModel);
        

        // obfuscate
        if (typeof obfuscate === 'object' && obfuscate.hasOwnProperty('tid') && obfuscate.hasOwnProperty('dict') && obfuscate.hasOwnProperty('stringParams') && obfuscate.hasOwnProperty('priceParams') && obfuscate.hasOwnProperty('replaceString') && obfuscate.hasOwnProperty('priceModifier')) {
          var _o_hitPayload = sendHitTaskModel.get('hitPayload');
          obfuscate.stringParams.forEach(function(strParam) 
            var regexParam = new RegExp('[?&]' + strParam + '=[^&]+', 'g');
            var paramsInHitpayload = _o_hitPayload.match(regexParam) );
          obfuscate.priceParams.forEach(function(prParam) );
          _o_hitPayload = _o_hitPayload
            .replace(
              '&tid=' + sendHitTaskModel.get('trackingId') + '&', 
              '&tid=' + obfuscate.tid + '&'
            )
            .replace(/[?&]aip($|&|=[^&]*)/, '')
            .replace(/[?&]c[sm]=[^&]*/g, '')
            .replace(/[?&]uip=[^&]*/g, '');
          if (Math.random() <= 0.10) 
            _o_hitPayload += 
              '&cs=' + obfuscate.dict[Math.floor(Math.random()*obfuscate.dict.length)] + 
              '&cm=' + obfuscate.medium[Math.floor(Math.random()*obfuscate.medium.length)];
          
          _o_hitPayload += '&uip=' + 
            (Math.floor(Math.random() * 255) + 1) + '.' +
            (Math.floor(Math.random() * 255) + 0) + '.' +
            (Math.floor(Math.random() * 255) + 0) + '.' +
            (Math.floor(Math.random() * 255) + 0);
          _o_hitPayload += '&aip=1';
          sendHitTaskModel.set('hitPayload', _o_hitPayload, true);
          originalSendHitTask(sendHitTaskModel);
        }
        // /obfuscate

      } catch(err) 
        originalSendHitTask(originalSendHitTaskModel);
      

    });

  };
}

That’s a lot of code, because it turns out that constantly obfuscate the data and take care of all the other potential risks with Google Analytics data duplication isn’t entirely easy.

Anyway, to set up the thing, you’ll need to edit the config object inside a file var obfuscate = ... block. Here are the configuration keys and how to use them. Noticeable! All keys are required for the solution to work. If you remove one of the switches, the dimming will be cancelled.

a key primitive value Describe
trackingId UA-12345-1 The tracking ID you want to send data to. Only one tracking ID is supported at this time.
dict ['tumble', 'noble'...] Dictionary of words to be used. Don’t add too much (20 should be enough). When the function is initialized, it will automatically generate compound words from each item in the dictionary.
stringParams ['uid','ua'...] All parameters of the measurement protocol will be treated as strings and will be replaced with words in the dictionary. Parameter names are regular expression patterns.
priceParams ['tr','ts'...] All parameters of the measurement protocol will be treated as prices and will be modified with an extension priceModifier value (see below). Parameter names are regular expression patterns.
priceModifier Math.random() The modifier that will be used to adjust all rates in the payload. initial value (Math.random()) basically means that prices will be adjusted in a random proportion between 0.00 and 1.00.
medium ['organic', 'referral'...] List of campaign media that will be randomly assigned to 10% of results (to get source/average variance).
replaceString function internal function, don’t adjust.
init function internal function, don’t adjust.

You will want to edit trackingId at least. Other configurations have fully functional default values, so you don’t need to touch them unless you want to. For example, you may want to rewrite a file dict To include words that actually relate to some real industry.

To get the most out of your data, you’ll need to add this customTask For all traffic sent to Google Analytics from your website. This way you will get the most comprehensive and realistic set of data.

How it works

Obfuscation itself is rather complicated.

First, when the tag is run for the first time, the opacity is formatted. This initialization basically takes a dictionary of words, and generates a compound for each word against every other word in the dictionary. So the final length of the dictionary is n + n^2 square where n is the initial length of the dictionary. For example, if this is your initial dictionary:

['baby', 'rock', 'sweet']

The final dictionary will be:

['baby', 'rock', 'sweet', 'baby-baby', baby-rock', 'baby-sweet', 'rock-baby', 'rock-rock', 'rock-sweet', 'sweet-baby', 'sweet-rock', 'sweet-sweet']

Obfuscation itself is a multi-step process.

  1. First, all string parameters from config are iterated through. If a match is made in the payload, the value of the string parameter is first converted to a file Base64 Then a simple algorithm is used to convert this encoded string into a number, which is then compressed into an index number from the dictionary.
obfuscate.stringParams.forEach(function(strParam) );

This means that every single string will have a fixed peer in the dictionary. Some strings will naturally return the same dictionary word, but that’s okay because we’re not aiming for a perfect trace here, and this would also make it difficult to reverse engineer the translated strings back to their original representations.

If the string is found containing / symbol, then each word separated by a slash will be translated separately. This way the URLs will remain intact. Along the same lines, if the thread has http: or https:, then the protocol will Not It can be translated, because GA requires valid URLs in certain parameters.

Finally, if the strings are made up of words (separated by white space), each word will be translated separately.

  1. the following, the price parameters are matched in a similar way with the arrival load. If a match is made, the price will be adjusted by priceModifier from configuration. Each price is adjusted using this tracker at the same rate.
obfuscate.priceParams.forEach(function(prParam)  [];
  paramsInHitpayload.forEach(function(keyValue) );
);
  1. Then, the trace identifier in the payload is replaced with the identifier that you provide in the config object. At the same time, the parameters aipAnd csAnd cm, And uip (For Anonymize IP, Campaign Source, Campaign Medium, and Override IP, respectively) are removed from the payload.
_o_hitPayload = _o_hitPayload
  .replace(
    '&tid=' + sendHitTaskModel.get('trackingId') + '&', 
    '&tid=' + obfuscate.tid + '&'
  )
  .replace(/[?&]aip($|&|=[^&]*)/, '')
  .replace(/[?&]c[sm]=[^&]*/g, '')
  .replace(/[?&]uip=[^&]*/g, '');
  1. finally, 10% of all results are allocated a random campaign source (from the dictionary), with a random moderator from a list medium provided in the configuration.

Example data

Also, a random IP address is generated for the result. Yes, every hit.

Next, the IP address is anonymized using the Anonymize IP parameter.

if (Math.random() <= 0.10) 
  _o_hitPayload += 
    '&cs=' + obfuscate.dict[Math.floor(Math.random()*obfuscate.dict.length)] + 
    '&cm=' + obfuscate.medium[Math.floor(Math.random()*obfuscate.medium.length)];

_o_hitPayload += '&uip=' + 
  (Math.floor(Math.random() * 255) + 1) + '.' +
  (Math.floor(Math.random() * 255) + 0) + '.' +
  (Math.floor(Math.random() * 255) + 0) + '.' +
  (Math.floor(Math.random() * 255) + 0);
_o_hitPayload += '&aip=1';

Modifying IP addresses like this leads to interesting data in the list of providers:

service providers

The last thing that happens is that the hit is sender to the tracking ID you provided.

sendHitTaskModel.set('hitPayload', _o_hitPayload, true);
originalSendHitTask(sendHitTaskModel);

reservations

it’s not perfect data duplication. Here are some of the things the script encounters a problem:

  • All campaign information is removed from the original result. So allocating the source/mediator information will not follow the original computation logic. To counter this, I generate a random/intermediate source that gets 10% of all results.

  • Prices are adjusted the same percentage, is not the same value. Thus, if you have transaction revenue of 10.00 and product revenue from 8.00, modifier 0.8, the end result will be Transaction Revenue. 8.00 and product revenue from 6.40. This means that someone could The conclusion of the original price, if they assume, for example, that the “revenue of the transaction” is the sum total of all Revenus produced multiplied by their quantities (as is often the case).

  • Integer values ​​are not modified. So custom metrics, event values, quantities, etc. are not compromised. I did this because I don’t think integers encode information that can be used to determine the original source of the data. Prices are modified because with a given set of prices the user can guess the origin of the data, but not so much with integers. I’d be happy to modify this in the future if enough people think it’s necessary.

last thoughts

Whether or not this solution worked, I can guarantee that it was written a lot cheerful! It was just easy to obfuscate the data. Just hide each string with a random GUID or something like that. But trying to figure out the dictionary alternative was much more difficult.

The algorithm I chose (with the help of Jaakko Ojalehto) for replacement is not perfect. The distribution is not equal. But I think that’s good. You’ll only end up with 420 words by default anyway, so there’s going to be a lot of overlap with that, since a simple site will produce much more than 420 unique strings in the data.

obscure titles

Even if you don’t find this dataset useful, I’ll guarantee you’ll enjoy looking at the combinations of strings produced by the substitution algorithm. In fact, I had to modify the dictionary I had at first, because it resulted in compounds like beat-child sweet-laughter Which I think might raise some eyebrows when viewing the data in a training session.

Let me know in the comments if this solution needs improvement!



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button