1 00:00:00,000 --> 00:00:11,880 Thank you once again for coming for the last talk of the day, at least for you. 2 00:00:11,880 --> 00:00:17,840 With me, Thomas Persson, he's a business developer at Digitalist Sweden, he has worked with tracking 3 00:00:17,840 --> 00:00:25,380 digital analytics since 2010 and has been a contributor to open source since 2007. 4 00:00:25,380 --> 00:00:28,920 What happens under the hood when you visit a website and what data is actually shared 5 00:00:28,920 --> 00:00:31,720 about you and who can actually access this data? 6 00:00:31,720 --> 00:00:34,720 In this talk, Thomas will answer these questions. 7 00:00:34,720 --> 00:00:35,720 Please begin. 8 00:00:35,720 --> 00:00:36,720 All right. 9 00:00:36,720 --> 00:00:37,720 Thank you. 10 00:00:37,720 --> 00:00:42,520 Yeah, I'm getting a bit tired now. 11 00:00:42,520 --> 00:00:50,680 Last session, for me at least, for this day, I'm really looking forward to having a long 12 00:00:50,680 --> 00:00:56,600 weekend after this, but let's push it over the finish line. 13 00:00:56,600 --> 00:01:04,760 So in this presentation, I will talk a little bit more on a higher level about privacy on 14 00:01:04,760 --> 00:01:11,040 the web and what type of data and try to explain what happens under the hood when we're visiting 15 00:01:11,040 --> 00:01:20,320 websites and also talk about personal identifying information in relation to web analytics and 16 00:01:20,320 --> 00:01:27,000 then finally give you some examples of how to manage this in Atom. 17 00:01:27,000 --> 00:01:29,240 So let's get started. 18 00:01:29,240 --> 00:01:37,320 Let's see if I can just find that correct window. 19 00:01:37,320 --> 00:01:41,400 So yeah, I got the presentation already, so I will skip that. 20 00:01:41,400 --> 00:01:48,720 You probably know Digitalist at the time, so I will skip that so we can move forward. 21 00:01:48,720 --> 00:01:55,520 But here's the kind of issues we have right now with privacy and the reason why I think 22 00:01:55,520 --> 00:02:01,560 a lot of organizations are moving away from tools like Google Analytics, but basically 23 00:02:01,560 --> 00:02:12,320 any cloud provider that is not provided by the EU and trying to do the simple explanation 24 00:02:12,320 --> 00:02:13,440 here at least. 25 00:02:13,440 --> 00:02:15,160 So this is me. 26 00:02:15,160 --> 00:02:16,920 I'm a citizen in Sweden. 27 00:02:16,920 --> 00:02:19,080 I live in Stockholm. 28 00:02:19,080 --> 00:02:28,400 So if I go to stockholm.se where I can find a lot of information to me as a citizen. 29 00:02:28,400 --> 00:02:35,440 I live in Sweden, so that means that we have a law in Sweden called OSL. 30 00:02:35,440 --> 00:02:41,400 It's a Swedish law about publicity and secrecy. 31 00:02:41,400 --> 00:02:46,040 It's been in place for quite some time and that law applies to me as a Swedish citizen. 32 00:02:46,040 --> 00:02:50,480 And it applies to the organization running stockholm.se. 33 00:02:50,480 --> 00:02:56,200 We also have GDPR, which is a European law and that applies to both of us as well since 34 00:02:56,200 --> 00:03:04,840 EU is the level above us, so to speak. 35 00:03:04,840 --> 00:03:10,800 To provide this service, stockholm.se probably has a lot of service providers. 36 00:03:10,800 --> 00:03:15,840 They probably have a company doing their hosting, their development and so on. 37 00:03:15,840 --> 00:03:17,880 Maybe design work or whatever. 38 00:03:17,880 --> 00:03:24,520 So a lot of actors are involved in this work and they all apply to these laws. 39 00:03:24,520 --> 00:03:30,680 So when stockholm.se write an agreement with these companies, these laws apply. 40 00:03:30,680 --> 00:03:34,200 OSL and GDPR applies to these contracts. 41 00:03:34,200 --> 00:03:38,960 The problem though is that when you start to write contracts with companies that are 42 00:03:38,960 --> 00:03:45,800 not part of the European Union, like for instance if you use Office 365 or if you sign up for 43 00:03:45,800 --> 00:03:54,040 Google Analytics on your site, the problem here is that Microsoft or Google or other 44 00:03:54,040 --> 00:04:01,760 companies or basically any cloud provider, they live under other laws. 45 00:04:01,760 --> 00:04:09,680 And in this particular example, we have something called the FISA 702 and that means that even 46 00:04:09,680 --> 00:04:15,920 though we write a contract with Microsoft, these laws will override that and Microsoft 47 00:04:15,920 --> 00:04:19,400 can't do anything about that, even though they would like to. 48 00:04:19,400 --> 00:04:26,960 So this is kind of the main problem with using cloud services and what FISA 702 actually 49 00:04:26,960 --> 00:04:39,640 says here is that if any American governmental institution basically requests for data, they 50 00:04:39,640 --> 00:04:47,760 in a Microsoft facility, they have to give it to that organization and they can't stop 51 00:04:47,760 --> 00:04:51,640 that in any way, at least not right now. 52 00:04:51,640 --> 00:04:57,400 So this means that any data that is stored in a facility owned by Microsoft or Google 53 00:04:57,400 --> 00:05:06,120 or whatever, and it doesn't matter if their physical servers are in Europe, you have to 54 00:05:06,120 --> 00:05:13,180 think about it as if it's readable by the American public organizations and we'll look 55 00:05:13,180 --> 00:05:17,880 into a little bit more about the problems with this later on. 56 00:05:17,880 --> 00:05:22,360 So we will also look into the actual details. 57 00:05:22,360 --> 00:05:27,320 So Google Analytics is obviously what we discuss when we talk about Matomo. 58 00:05:27,320 --> 00:05:32,880 So yes, we need to think about when you use Google Analytics that we are leaking data 59 00:05:32,880 --> 00:05:41,120 to foreign powers like NSA and so on. 60 00:05:41,120 --> 00:05:46,320 We also have to think about that we're actually leaking personal data without having a consent 61 00:05:46,320 --> 00:05:50,080 and this is an issue with GDPR in Europe. 62 00:05:50,080 --> 00:05:55,840 We're also selling data about our visitors and that's more related to Google specifically. 63 00:05:55,840 --> 00:06:03,800 But historically, this started with quite big, when the whistleblower Edward Snowden 64 00:06:03,800 --> 00:06:10,040 with that didn't start though then, but it actually got really, really clear to us that 65 00:06:10,040 --> 00:06:19,960 when he shared the NSA documents about what was happening, people started to realize how 66 00:06:19,960 --> 00:06:22,200 big this really was. 67 00:06:22,200 --> 00:06:29,040 So we talked about the law called FISA, it's actually called Foreign Intelligence Violence 68 00:06:29,040 --> 00:06:30,040 Act. 69 00:06:30,040 --> 00:06:38,320 It was a law created in the USA to actually protect the American citizen because there 70 00:06:38,320 --> 00:06:44,760 has been a lot of surveillance going on, phone surveillance mainly and other things without 71 00:06:44,760 --> 00:06:47,360 any regulations in the United States. 72 00:06:47,360 --> 00:06:50,960 It was actually created to protect American citizens. 73 00:06:50,960 --> 00:07:00,640 9-11 was the event that happened and that really changed how the American states started 74 00:07:00,640 --> 00:07:02,200 to act differently. 75 00:07:02,200 --> 00:07:08,320 So people at NSA has described this in many, many places that there is a time before and 76 00:07:08,320 --> 00:07:13,560 after 9-11 and we'll see a little bit more about the statistics. 77 00:07:13,560 --> 00:07:22,240 They basically changed the laws to make it even easier to supervise the society and in 78 00:07:22,240 --> 00:07:26,680 particular people that are not members of the United States. 79 00:07:26,680 --> 00:07:30,160 So Europe is a target market for this. 80 00:07:30,160 --> 00:07:33,560 There are a lot of systems in place at NSA. 81 00:07:33,560 --> 00:07:39,960 Some examples are systems called Upstream or Prism and Treasure Map. 82 00:07:39,960 --> 00:07:48,680 This is basically systems that are snooping on internet traffic, phones, and etc. 83 00:07:48,680 --> 00:07:53,920 And they have direct access to data centers at Google, Microsoft, Amazon, etc. 84 00:07:53,920 --> 00:07:58,400 So it's really a large scale. 85 00:07:58,400 --> 00:08:06,400 Also we have to know that internet traffic is passing through United American borders 86 00:08:06,400 --> 00:08:11,760 and this means that it often gets stuck in this system called Upstream. 87 00:08:11,760 --> 00:08:17,720 We'll look into more details about how easily this actually happens and what type of data 88 00:08:17,720 --> 00:08:20,320 they can actually tap into. 89 00:08:20,320 --> 00:08:25,680 So is this really a problem in a country like Sweden? 90 00:08:25,680 --> 00:08:33,360 Is this really something that government agencies use to approach Swedish citizens? 91 00:08:33,360 --> 00:08:36,000 So Sweden is a rather small country in Europe. 92 00:08:36,000 --> 00:08:39,680 We have about 10 million citizens. 93 00:08:39,680 --> 00:08:46,000 But the interesting part here is that Google, Amazon, Microsoft, and a lot of other companies, 94 00:08:46,000 --> 00:08:51,000 I think it was about 50 companies in the beginning after the Snowden incident, have started to 95 00:08:51,000 --> 00:09:01,880 report how often they leave data to public agencies in the States. 96 00:09:01,880 --> 00:09:04,240 So this is the data from Google. 97 00:09:04,240 --> 00:09:08,820 And as you can see, the growth here is pretty massive. 98 00:09:08,820 --> 00:09:21,920 So they have left out information about 235,000 accounts the first six months of 2020. 99 00:09:21,920 --> 00:09:27,600 The data is a bit lacking behind, but the numbers are actually growing. 100 00:09:27,600 --> 00:09:33,960 So the same period in Sweden, we had 370 cases from Google where they handed out information 101 00:09:33,960 --> 00:09:40,200 and you can think about this as, okay, maybe this is information that the Swedish police 102 00:09:40,200 --> 00:09:43,160 is asking from NSA to give them. 103 00:09:43,160 --> 00:09:46,400 But that data is actually not included here. 104 00:09:46,400 --> 00:09:54,140 This is only data requested that American organizations ask for. 105 00:09:54,140 --> 00:09:59,480 So we don't know what this is and it's totally out of our control. 106 00:09:59,480 --> 00:10:07,760 I did a research and checked for Google, Microsoft, and Apple and as you can see, there's quite 107 00:10:07,760 --> 00:10:10,120 a lot of information handed out. 108 00:10:10,120 --> 00:10:16,600 And if you summarize this, it's actually eight times a day data is requested about Swedish 109 00:10:16,600 --> 00:10:19,320 citizens from these four companies. 110 00:10:19,320 --> 00:10:24,720 So there's no doubt that this is a huge, huge problem. 111 00:10:24,720 --> 00:10:31,400 And because of this, we need to change the way, and I think this is extra critical if 112 00:10:31,400 --> 00:10:40,520 you're a public organization handling citizen data or whatever, it goes for private companies 113 00:10:40,520 --> 00:10:43,320 as well, but super important. 114 00:10:43,320 --> 00:10:52,520 So what I usually say is that the public sector needs to secure our infrastructure. 115 00:10:52,520 --> 00:10:59,280 We need to use open technologies such as Matomo, but that's just one thing of the problem. 116 00:10:59,280 --> 00:11:04,000 We need to secure the traffic between the residents and the authorities so it never 117 00:11:04,000 --> 00:11:05,000 leaves Europe. 118 00:11:05,000 --> 00:11:10,720 This is another problem that we saw that if the internet traffic just passes through United 119 00:11:10,720 --> 00:11:25,240 States, it's much probable that it gets stuck in this surveillance network. 120 00:11:25,240 --> 00:11:29,880 So let's go into details of the web. 121 00:11:29,880 --> 00:11:34,560 So what happens when I visit five websites under the hood? 122 00:11:34,560 --> 00:11:40,640 So I did a little animation with a plugin called Lightbeam for Firefox. 123 00:11:40,640 --> 00:11:46,400 So what we're seeing here is what happens, these are requests that's happened under the 124 00:11:46,400 --> 00:11:54,240 hood, and this is actually public sector websites are related in Sweden, and you can see the 125 00:11:54,240 --> 00:12:01,480 spider diagram on the left that all of them under the hood request information from Google 126 00:12:01,480 --> 00:12:02,480 in this case. 127 00:12:02,480 --> 00:12:04,880 They're using Google Analytics basically. 128 00:12:04,880 --> 00:12:12,720 So that means that Google can profile all these visits not only on these particular 129 00:12:12,720 --> 00:12:23,160 websites, but they can create profiles on individual level of pretty sensitive sites. 130 00:12:23,160 --> 00:12:27,880 So yes, they use Google Analytics, and we need to remember that this is something that 131 00:12:27,880 --> 00:12:29,440 Google sells, of course. 132 00:12:29,440 --> 00:12:37,040 The 3% of the revenue two years ago, I think, was coming from ad sales basically. 133 00:12:37,040 --> 00:12:41,700 So it's a huge problem. 134 00:12:41,700 --> 00:12:47,640 So it's not just about the privacy and that they're leaking data, we're actually selling 135 00:12:47,640 --> 00:12:50,920 their behavior to Google. 136 00:12:50,920 --> 00:12:54,800 And Google is just one part of the problem. 137 00:12:54,800 --> 00:12:58,800 You need to be aware of everything you do on your website. 138 00:12:58,800 --> 00:13:04,520 So this is another Swedish, it's actually the government website in Sweden, and I've 139 00:13:04,520 --> 00:13:08,240 been picking on them for a few years now. 140 00:13:08,240 --> 00:13:11,240 The good thing is that they have actually changed this now. 141 00:13:11,240 --> 00:13:16,520 So if you go to this site, they actually change how it works, and I think I affected them 142 00:13:16,520 --> 00:13:17,920 a bit. 143 00:13:17,920 --> 00:13:20,160 But the example is still good. 144 00:13:20,160 --> 00:13:26,800 So if you go to a page called contact on the Swedish government site, and you look under 145 00:13:26,800 --> 00:13:33,240 the hood, you do a request in your browser, and it's sent to their server, and back we 146 00:13:33,240 --> 00:13:34,240 get HTML. 147 00:13:34,240 --> 00:13:35,240 You all know this. 148 00:13:35,240 --> 00:13:44,040 And in this HTML, we're actually requesting 33 other resources from six different domains. 149 00:13:44,040 --> 00:13:50,160 And for each of these requests, the way internet works is actually we're sending my IP address 150 00:13:50,160 --> 00:13:52,680 from my browser to every server. 151 00:13:52,680 --> 00:14:00,560 And if we dig down in this and look into a specific one of these requests, it's a little 152 00:14:00,560 --> 00:14:09,240 JavaScript called find.js, and it's coming from this domain, dl.episerver.net. 153 00:14:09,240 --> 00:14:15,280 And what's interesting here is that with this request, we're actually also sending something 154 00:14:15,280 --> 00:14:17,000 called the header. 155 00:14:17,000 --> 00:14:21,880 The header contains another information that is called the referrer. 156 00:14:21,880 --> 00:14:27,720 We're also having information such as the language in my browser. 157 00:14:27,720 --> 00:14:33,360 We can see what browser I have in the user agent, et cetera, et cetera. 158 00:14:33,360 --> 00:14:39,880 So basically the information shared with this server, dl.episerver.net, is the same kind 159 00:14:39,880 --> 00:14:46,120 of information that you use when you track data in Matomo, meaning all the sites that 160 00:14:46,120 --> 00:14:51,760 we request for information just to present our website is basically able to do the same 161 00:14:51,760 --> 00:14:56,780 profiling that we can do with Matomo or any other tool on our website. 162 00:14:56,780 --> 00:15:03,120 So you really need to control what type of resources you include on your websites, and 163 00:15:03,120 --> 00:15:08,120 they have to be within the European Union if you're a European company. 164 00:15:08,120 --> 00:15:09,440 All right. 165 00:15:09,440 --> 00:15:16,200 So there are some web standards where you can actually try to block this. 166 00:15:16,200 --> 00:15:23,960 There's something called the referrer policy, and if you set this to no referrer, you actually 167 00:15:23,960 --> 00:15:27,480 can block sending the referrer information. 168 00:15:27,480 --> 00:15:34,080 The problem though is that this standard is not respected by Microsoft, so all Microsoft 169 00:15:34,080 --> 00:15:40,720 browsers will send that anyway, and it's also not respected by iPhones. 170 00:15:40,720 --> 00:15:45,640 So Safari and iPhones will send that even though you try to block it, meaning quite 171 00:15:45,640 --> 00:15:55,080 a few percentages of the visitors you will have on your website since it's quite common 172 00:15:55,080 --> 00:15:57,760 software combinations or hardware combinations. 173 00:15:57,760 --> 00:15:58,760 All right. 174 00:15:58,760 --> 00:16:02,880 So to summarize, we do 26 calls. 175 00:16:02,880 --> 00:16:09,960 Eight of these calls goes to four different domains in the United States, and in all these 176 00:16:09,960 --> 00:16:15,480 we share the IP address and the URL, meaning that all these domains can basically start 177 00:16:15,480 --> 00:16:18,480 profiling our visitors. 178 00:16:18,480 --> 00:16:21,800 All right. 179 00:16:21,800 --> 00:16:29,180 So the next thing, because that's really about the URLs and the IP addresses that are sensitive 180 00:16:29,180 --> 00:16:34,940 information that we need to manage and handle in our Matomo installation, for instance. 181 00:16:34,940 --> 00:16:39,400 But what about personal data in general? 182 00:16:39,400 --> 00:16:45,680 So personal data can be defined as simple things as IP addresses, as we mentioned before, 183 00:16:45,680 --> 00:16:51,680 email addresses, zip codes, social security numbers, et cetera, et cetera. 184 00:16:51,680 --> 00:16:53,900 This list you've probably seen. 185 00:16:53,900 --> 00:16:59,640 This means that when we're setting up tracking, we do not want to store these things in Matomo. 186 00:16:59,640 --> 00:17:07,800 And usually this is not something people do, even though there are issues, of course, sometimes. 187 00:17:07,800 --> 00:17:17,320 But more problematic is the thing, the fact that data in combination with other data that 188 00:17:17,320 --> 00:17:24,120 identifies an individual is also to be considered as personal data. 189 00:17:24,120 --> 00:17:31,360 And the best example I have on this is when you start to store location information. 190 00:17:31,360 --> 00:17:38,520 So in this example, I have a little village in Sweden with 55th inhabitants. 191 00:17:38,520 --> 00:17:39,520 It's called Lomtresk. 192 00:17:39,520 --> 00:17:43,160 So now you know a bit of Swedish as well. 193 00:17:43,160 --> 00:17:48,400 The interesting part here is if you store that dimension of data together with your 194 00:17:48,400 --> 00:17:53,200 page views, you can start to get problems pretty quickly. 195 00:17:53,200 --> 00:18:00,480 So think about the other meta information or metadata you store in analytics or in your 196 00:18:00,480 --> 00:18:03,600 Matomo installation. 197 00:18:03,600 --> 00:18:10,680 The first one would be, like, let's say we have a custom dimension with the profession. 198 00:18:10,680 --> 00:18:13,980 Or maybe someone has been searching for a job. 199 00:18:13,980 --> 00:18:19,400 Or something that we can identify a job role here. 200 00:18:19,400 --> 00:18:26,880 Or we can identify that this is the language is Turkish in the browser. 201 00:18:26,880 --> 00:18:29,320 And maybe we can identify that this is a woman. 202 00:18:29,320 --> 00:18:35,120 We can think about how quickly, how easy it starts to get to identify this person just 203 00:18:35,120 --> 00:18:41,440 from these quite anonymous combinations of data. 204 00:18:41,440 --> 00:18:45,160 We can continue. 205 00:18:45,160 --> 00:18:51,360 Just an unusual mobile phone would actually be a problematic thing here. 206 00:18:51,360 --> 00:18:56,760 Maybe everyone in Lomtresk knows who has an iPhone of a specific model. 207 00:18:56,760 --> 00:19:04,640 Or you can find that in an image on Facebook or whatever and start to identify visitors 208 00:19:04,640 --> 00:19:05,640 here. 209 00:19:05,640 --> 00:19:11,320 So that leads us to that personal data is not something fixed. 210 00:19:11,320 --> 00:19:19,600 We need to continuously monitor this and we need to continuously be able to act. 211 00:19:19,600 --> 00:19:26,680 And what we also need to know is that when we include other services on our website, 212 00:19:26,680 --> 00:19:29,760 visitors sharing this information with them. 213 00:19:29,760 --> 00:19:36,840 So if you add, most of them are big in Sweden, but it's a lot of things that you need to 214 00:19:36,840 --> 00:19:41,600 look into and make sure that you're not sharing data with. 215 00:19:41,600 --> 00:19:44,320 I actually created a little tool. 216 00:19:44,320 --> 00:19:45,520 It's actually a JavaScript. 217 00:19:45,520 --> 00:19:52,720 So if you go to my GitHub profile, what this one does is that I'm actually going to show 218 00:19:52,720 --> 00:19:54,480 you. 219 00:19:54,480 --> 00:20:03,440 So if I copy this code and I can basically do this on any website. 220 00:20:03,440 --> 00:20:12,120 So let's do it on atomocamp.org and see if we are okay here. 221 00:20:12,120 --> 00:20:14,200 I'm going to open the inspect. 222 00:20:14,200 --> 00:20:22,600 We're going to just paste this little script and what it will do is to start a request. 223 00:20:22,600 --> 00:20:26,360 Okay, this one was pretty good. 224 00:20:26,360 --> 00:20:33,600 We only included one external script and the country it was requesting was Germany. 225 00:20:33,600 --> 00:20:43,480 So let's do this at the Swedish government again and see how well they are behaving. 226 00:20:43,480 --> 00:20:49,880 I'm going to go there and I'm going to paste the JavaScript and now we're requesting a 227 00:20:49,880 --> 00:20:57,040 lot of websites and I can actually see that they improved since I picked on them. 228 00:20:57,040 --> 00:21:03,240 From all of these sites, we are pulling different scripts and here you can see the country from 229 00:21:03,240 --> 00:21:04,960 where they are requested. 230 00:21:04,960 --> 00:21:08,880 So they did pretty good. 231 00:21:08,880 --> 00:21:14,720 That was not the scenario six months ago and this is good because this means that even 232 00:21:14,720 --> 00:21:23,280 though they share data with these players, they're even in Sweden here so if they have 233 00:21:23,280 --> 00:21:30,360 a contract with them not to share data with anyone else, we're actually legally safe. 234 00:21:30,360 --> 00:21:38,440 Which wouldn't be the case if the country of the scripts would be the United States 235 00:21:38,440 --> 00:21:44,560 and one very common thing is to use CDN networks to load JavaScript for instance and that would 236 00:21:44,560 --> 00:21:50,120 be a problem. 237 00:21:50,120 --> 00:21:54,520 So what do we do when the accident occurs? 238 00:21:54,520 --> 00:22:00,960 So since storing personal data is something you should actually plan for that you will 239 00:22:00,960 --> 00:22:07,360 do, that's usually how I talk to my clients. 240 00:22:07,360 --> 00:22:11,320 You can't really stop this from happening. 241 00:22:11,320 --> 00:22:15,400 You need to plan for when it happens. 242 00:22:15,400 --> 00:22:24,080 So you have to have routines in place to act when you find personal data in your analytics 243 00:22:24,080 --> 00:22:28,120 and this is the action plan I usually show. 244 00:22:28,120 --> 00:22:32,200 So first we need to assess how severe this is. 245 00:22:32,200 --> 00:22:38,260 I mean how sensitive is the data that we have leaked? 246 00:22:38,260 --> 00:22:42,480 As quickly as possible we prevent further collection. 247 00:22:42,480 --> 00:22:46,480 So data is not collected more than necessary. 248 00:22:46,480 --> 00:22:55,000 In Sweden we actually have to decide within 72 hours after the incident, we notice the 249 00:22:55,000 --> 00:23:05,600 incident to our reporting organization basically for GDPR and data leakage. 250 00:23:05,600 --> 00:23:11,080 What we need to do after this is of course to try to clear or anonymize the data. 251 00:23:11,080 --> 00:23:21,920 We need to document the incident and potentially inform the victims like if users' data was 252 00:23:21,920 --> 00:23:27,920 really sensitive we might have to contact them even and then I usually say that let's 253 00:23:27,920 --> 00:23:32,760 have a retrospective to see if we can avoid that this happens again. 254 00:23:32,760 --> 00:23:40,880 Maybe this is about training because most of these mistakes are personal but not always. 255 00:23:40,880 --> 00:23:44,000 Sometimes there are technical solutions. 256 00:23:44,000 --> 00:23:52,500 Clearing data though is something that we need to work pretty much with and that's actually 257 00:23:52,500 --> 00:23:59,400 something that is possible if you're using Matomo but it's not necessarily easy just 258 00:23:59,400 --> 00:24:01,320 because you can. 259 00:24:01,320 --> 00:24:04,520 So deleting data in Matomo usually looks like this. 260 00:24:04,520 --> 00:24:09,680 First of all you have to find the data, you have to identify the actual visitors and the 261 00:24:09,680 --> 00:24:10,680 raw data. 262 00:24:10,680 --> 00:24:14,160 So you have to find the visitor IDs that are affected. 263 00:24:14,160 --> 00:24:21,120 You need to delete these visits, I will show you some examples soon and then you need to 264 00:24:21,120 --> 00:24:26,080 reprocess your visitor log so that the aggregated data reports are updated. 265 00:24:26,080 --> 00:24:34,520 So for instance if you find an email address in your page view report you would first need 266 00:24:34,520 --> 00:24:43,520 to find the visitors that sent this data in and delete those and then you need to reprocess 267 00:24:43,520 --> 00:24:50,160 your aggregated reports for the dates when that data was stored. 268 00:24:50,160 --> 00:24:58,240 So that's quite a big job to do that. 269 00:24:58,240 --> 00:25:04,040 One thing you can do to start monitoring this is actually to use a plugin called the alert 270 00:25:04,040 --> 00:25:13,680 plugin and in that plugin you can start to set up rules to identify personal sensitive 271 00:25:13,680 --> 00:25:18,200 information and I will show you an example of this. 272 00:25:18,200 --> 00:25:23,680 So what you do is that you for instance if you want to start monitoring your page views 273 00:25:23,680 --> 00:25:27,920 you can set up an alert and you can use the regular expression. 274 00:25:27,920 --> 00:25:36,040 In the example here I'm matching a date but I will give you some and then if we have more 275 00:25:36,040 --> 00:25:43,840 than one page view containing a date I will get an alarm or a report that at least highlights 276 00:25:43,840 --> 00:25:45,920 that I need to look at this. 277 00:25:45,920 --> 00:25:50,880 So an example here is that you can find a lot of regular expressions online. 278 00:25:50,880 --> 00:25:54,080 So in this case I have the example of an email address. 279 00:25:54,080 --> 00:26:00,000 So let's monitor if we have email addresses in our events or our page views and ping me 280 00:26:00,000 --> 00:26:04,800 if that happens. 281 00:26:04,800 --> 00:26:10,080 Finally you can combine these into a long string so you don't have to create individual 282 00:26:10,080 --> 00:26:14,440 reports for every monitor you have. 283 00:26:14,440 --> 00:26:20,400 The drawback of this is that the alert plugin will only process once a day or every night. 284 00:26:20,400 --> 00:26:28,560 You can't do that more often and it's also hard to identify something like a name or 285 00:26:28,560 --> 00:26:29,560 something like that. 286 00:26:29,560 --> 00:26:38,400 So it's only data that is quite easy to identify that you can find this way. 287 00:26:38,400 --> 00:26:45,960 Once you find it you can use a tool called GDPR tools and try to delete things that way. 288 00:26:45,960 --> 00:26:53,440 You need to find the visitor ID and when you do so you can delete that. 289 00:26:53,440 --> 00:26:57,760 You can also use the API, the example below. 290 00:26:57,760 --> 00:27:11,120 If you have a lot of data you can create a script and call the API many times. 291 00:27:11,120 --> 00:27:14,440 I can also show you what is a bit problematic. 292 00:27:14,440 --> 00:27:22,240 So in this case I look at the pages report and now I'm just simulating but let's say 293 00:27:22,240 --> 00:27:27,460 that this is my name on our site so it's not really personal data. 294 00:27:27,460 --> 00:27:33,520 But what if I wanted to delete this information because it's sensitive? 295 00:27:33,520 --> 00:27:43,800 So what I would do is to click on this one, segmented visitor log. 296 00:27:43,800 --> 00:27:51,200 So when I click this one I will actually get the visits and in this case I even have a 297 00:27:51,200 --> 00:27:57,240 little JavaScript in my browser so I actually have a little button here that allows me to 298 00:27:57,240 --> 00:27:59,320 delete this directly. 299 00:27:59,320 --> 00:28:05,040 But otherwise you can hover on the IP address and you can actually find the visitor ID up 300 00:28:05,040 --> 00:28:06,040 there. 301 00:28:06,040 --> 00:28:07,600 You can probably see it in the screen. 302 00:28:07,600 --> 00:28:15,360 This is the information that you need to grab and then use in the GDPR tools and apply 303 00:28:15,360 --> 00:28:16,360 it here. 304 00:28:16,360 --> 00:28:23,000 It's called visitor ID. 305 00:28:23,000 --> 00:28:25,520 You can actually steal this little number. 306 00:28:25,520 --> 00:28:29,280 So as you see it's starting to get quite time consuming. 307 00:28:29,280 --> 00:28:35,560 If you have a lot of personal data this is a terribly time consuming job. 308 00:28:35,560 --> 00:28:39,560 This little JavaScript helps me. 309 00:28:39,560 --> 00:28:44,440 What it actually does is that it calls the API. 310 00:28:44,440 --> 00:28:47,440 I have just created a link to that. 311 00:28:47,440 --> 00:29:01,380 It's actually this little line here and I automatically add the ID visit to delete it. 312 00:29:01,380 --> 00:29:10,640 So the idea I have for the future is to actually build a PII monitoring plugin for Matomo. 313 00:29:10,640 --> 00:29:18,560 That plugin would first of all set up some predefined rules to identify things like email 314 00:29:18,560 --> 00:29:24,800 addresses, credit card numbers, et cetera, so that we have alarms. 315 00:29:24,800 --> 00:29:31,280 I would also set it up to watch more types of data directly like events, page views, 316 00:29:31,280 --> 00:29:32,280 refer URLs. 317 00:29:32,280 --> 00:29:36,720 You can come up with a lot of ideas. 318 00:29:36,720 --> 00:29:40,600 I would also make it possible to add exceptions. 319 00:29:40,600 --> 00:29:49,320 After you try to identify something or identify something that is actually okay. 320 00:29:49,320 --> 00:29:52,280 So we want to add an exception to that. 321 00:29:52,280 --> 00:29:56,960 I would also like to make it easy to delete data so the report would actually give me 322 00:29:56,960 --> 00:30:04,520 all the visitor IDs and make it possible to delete them directly. 323 00:30:04,520 --> 00:30:10,240 I would also like to be possible to just anonymize the data and not maybe delete it. 324 00:30:10,240 --> 00:30:15,760 So let's say I found a credit card number but I still want to have the traffic stored 325 00:30:15,760 --> 00:30:16,760 in my website. 326 00:30:16,760 --> 00:30:25,880 I just want to anonymize that information. 327 00:30:25,880 --> 00:30:31,200 But that's not really possible today unless you write SQL scripts. 328 00:30:31,200 --> 00:30:40,720 So these are kind of the ideas that I have around this plugin. 329 00:30:40,720 --> 00:30:43,160 It shouldn't be too hard to do the basics. 330 00:30:43,160 --> 00:30:48,040 The hard part is of course to start monitoring combinations of data. 331 00:30:48,040 --> 00:30:57,560 Like it would be impossible almost to do what I showed you with the small village Lundträsk 332 00:30:57,560 --> 00:31:01,680 and identify those individuals. 333 00:31:01,680 --> 00:31:07,560 Alright, so the last slide actually or the second last slide. 334 00:31:07,560 --> 00:31:12,480 But usually this is what I recommend people to look into if you want to secure your web 335 00:31:12,480 --> 00:31:14,000 analysis. 336 00:31:14,000 --> 00:31:19,160 So first of all collect consents from your visitor and inform them that you are tracking. 337 00:31:19,160 --> 00:31:22,080 This is all from the GDPR. 338 00:31:22,080 --> 00:31:23,840 Check your data collection. 339 00:31:23,840 --> 00:31:29,440 Find the obvious collection of personal identification information and set up your Matomo instance 340 00:31:29,440 --> 00:31:30,800 properly. 341 00:31:30,800 --> 00:31:39,200 That means anonymizing IP addresses, not storing data longer than needed, etc, etc. 342 00:31:39,200 --> 00:31:43,280 Also make sure to manage the quality of your application. 343 00:31:43,280 --> 00:31:49,920 So they are not sending personal identification information because that's normally one of 344 00:31:49,920 --> 00:31:52,800 the issues we have when we get bad data. 345 00:31:52,800 --> 00:31:56,280 It's often the applications that are wrong. 346 00:31:56,280 --> 00:32:00,000 Also make sure to secure your technical infrastructure. 347 00:32:00,000 --> 00:32:05,680 The service has to be within Europe and it has to be owned by a European company. 348 00:32:05,680 --> 00:32:12,080 You can't use cloud providers in Europe because you're invalidating GDPR by doing so. 349 00:32:12,080 --> 00:32:17,200 Also make sure to limit the time for how long you store data. 350 00:32:17,200 --> 00:32:25,440 Make sure to have processes in place to monitor data and make sure that you know how to act 351 00:32:25,440 --> 00:32:28,840 when you find problems. 352 00:32:28,840 --> 00:32:31,920 Alright, so that was the last slide. 353 00:32:31,920 --> 00:32:37,560 I think it's time for questions. 354 00:32:37,560 --> 00:32:40,040 Thank you for this very interesting talk. 355 00:32:40,040 --> 00:32:48,160 If there are any questions, please post them or write them down in the corresponding chat. 356 00:32:48,160 --> 00:32:55,440 There's been some small discussion of the privacy test and I will share it if it's alright 357 00:32:55,440 --> 00:32:56,440 for you. 358 00:32:56,440 --> 00:32:57,440 Yeah, I saw that. 359 00:32:57,440 --> 00:32:58,440 Nice, Lukas. 360 00:32:58,440 --> 00:33:05,240 That one, of course, is a good thing. 361 00:33:05,240 --> 00:33:09,520 So who wants to help me write the PII plug-in, Lukas? 362 00:33:09,520 --> 00:33:17,960 Are you ready to help? 363 00:33:17,960 --> 00:33:18,960 Check my calendar. 364 00:33:18,960 --> 00:33:19,960 Yeah, of course. 365 00:33:19,960 --> 00:33:22,960 It's the same here, I think. 366 00:33:22,960 --> 00:33:28,640 But it's definitely just something I think a lot of people would appreciate. 367 00:33:28,640 --> 00:33:35,720 Alright, so if we don't have more questions, I'm going to say thank you for today and hope 368 00:33:35,720 --> 00:33:36,720 you enjoyed it. 369 00:33:36,720 --> 00:33:41,400 I'm pretty tired now, so I'm going to have a really nice weekend with my family, maybe 370 00:33:41,400 --> 00:33:48,880 go out and pick some mushrooms and yeah, be outside. 371 00:33:48,880 --> 00:33:53,840 There would be one last question that you would be willing to answer. 372 00:33:53,840 --> 00:33:59,280 Is there any other kind of information apart from cookies and JavaScript code, codes sent 373 00:33:59,280 --> 00:34:02,480 to third parties? 374 00:34:02,480 --> 00:34:10,800 Anything you basically include, so it would be fonts, images, anything. 375 00:34:10,800 --> 00:34:15,600 Anything you request might actually share that information. 376 00:34:15,600 --> 00:34:21,080 Thank you for that. 377 00:34:21,080 --> 00:34:28,080 And if there aren't any other questions, which I think there aren't, then yes, enjoy a long 378 00:34:28,080 --> 00:34:29,080 weekend. 379 00:34:29,080 --> 00:34:30,080 Yes, thank you. 380 00:34:30,080 --> 00:34:31,080 Let's see. 381 00:34:31,080 --> 00:34:38,080 Maybe more questions coming, we'll see. 382 00:34:38,080 --> 00:34:42,760 Oh yeah, a ton of them. 383 00:34:42,760 --> 00:34:46,240 Maybe they're just thanking you for the great talks. 384 00:34:46,240 --> 00:34:49,240 We'll see. 385 00:34:49,240 --> 00:34:54,360 Two seconds. 386 00:34:54,360 --> 00:34:58,160 It seems to be the case that there aren't any more questions coming. 387 00:34:58,160 --> 00:34:59,160 Let's have a nice weekend. 388 00:34:59,160 --> 00:35:00,160 That sounds good. 389 00:35:00,160 --> 00:35:01,160 So let's call it here. 390 00:35:01,160 --> 00:35:02,160 Yeah. 391 00:35:02,160 --> 00:35:07,480 As always, if there are any other questions, you can still be asked. 392 00:35:07,480 --> 00:35:10,600 And yeah, have a great day, have a great weekend. 393 00:35:10,600 --> 00:35:30,920 Thanks for your talks.