The size of the World Wide Web

sizeofweb

Reading a paper by Lev Manovich I came across a reference to the web site WorldWideWebSize.com which graphs the size of the World Wide Web. The web site searches Google and Bing daily for different words from a corpus and then uses the total results to estimate the size of the web.

When you know, for example, that the word ‘the’ is present in 67,61% of all documents within the corpus, you can extrapolate the total size of the engine’s index by the document count it reports for ‘the’. If Google says that it found ‘the’ in 14.100.000.000 webpages, an estimated size of the Google’s total index would be 23.633.010.000.

In the screen grab above you can see that the estimated size can change dramatically over time.  Hard to tell why.

Around the World Conference

ATW_Logo

Last week we held our third Around the World Conference on the subject of “Big Data”. We had some fabulous panels from countries including Ireland, Canada, Israel, Nigeria, Japan, China, Australia, USA, Belgium, Italy, and Brazil.

The Around the World Conference streams speakers and panels from around the world out to everyone on the net. We also edit and archive the video clips. This model allows for a sustainable conversation across continents that doesn’t involve flying people around. It allows a lot people who wouldn’t usually be included to speak. We also find there are technical hiccups, but that happens in on-site conferences too.

Editorialisation Et Nouvelles Formes De Publication

In the last couple of weeks I’ve been at two interesting conferences and took research notes.

  1. I gave a keynote on “Big Data and the Humanities” at the Northwestern Research Computation Day (link to my research notes). I gave a lot of examples of projects and visualizations.
  2. At the Éditorialisation Et Nouvelles Formes De Publication (link to my research notes) conference I spoke about “Publishing Tools: A Theatre of Machines”. I showed how text analysis machines have evolved.

TSA’s Secret Behavior Checklist to Spot Terrorists

The Intercept has published the TSA’s behaviour checklist for spotting terrorists as part of two stories. See, Exclusive: TSA’s Secret Behavior Checklist to Spot Terrorists. The Spot Referral Report includes all sorts of behaviours like “Arrives late for flight …”. The idea of the report is that behaviours are assigned points and if someone gets more than a certain number of points the suspect is referred to a Law Enforcement Officer (LEO). The checklist is part of a SPOT (Screening of Passengers by Observation Techniques) Referral Report that is filled out when someone is “spotted” by the TSA. A second story from the Intercept claims that Exclusive: TSA ‘Behavior Detection’ Program Targeting Undocumented Immigrants, Not Terrorists.

Is it Research or is it Spying? Thinking-Through Ethics in Big Data AI and Other Knowledge Sciences

Is it Research or is it Spying? Thinking-Through Ethics in Big Data AI and Other Knowledge Sciences has just been published online. It was written with Bettina Berendt and Marco Büchler and came out of a Dagschule retreat where a group of us started talking about ethics and big data. Here is the abstract:

How to be a knowledge scientist after the Snowden revelations?” is a question we all have to ask as it becomes clear that our work and our students could be involved in the building of an unprecedented surveillance society. In this essay, we argue that this affects all the knowledge sciences such as AI, computational linguistics and the digital humanities. Asking the question calls for dialogue within and across the disciplines. In this article, we will position ourselves with respect to typical stances towards the relationship between (computer) technology and its uses in a surveillance society, and we will look at what we can learn from other fields. We will propose ways of addressing the question in teaching and in research, and conclude with a call to action.

A PDF of our author version is here.

NSA phone record collection does little to prevent terrorist attacks, group says

One of the key issues raised by Snowden is whether all this surveillance works. The Washington Post has a story from a year ago reporting that NSA phone record collection does little to prevent terrorist attacks, group says. This story is based on a report:

Continue reading NSA phone record collection does little to prevent terrorist attacks, group says

Snowden Surveillance Archive

Canadian Journalists for Free Expression and partners have announced and released a searchable Snowden Surveillance Archive. This archive is,

a complete collection of all documents that former NSA contractor Edward Snowden leaked in June 2013 to journalists Laura Poitras, Glenn Greenwald and Ewen MacAskill, and subsequently were published by news media, such as The GuardianThe New York Times, The Washington PostDer SpiegelLe MondeEl Mundo and The Intercept.

It is dynamic. As new documents are published they will be added.

You can hear the announcement and Snowden in CBC’s stream of Snowden Live: Canada and the Security State.

One thing I don’t understand is why, in at least one case, the archived document is of lower quality than the one originally released. For example, compare the Snowden Archive of the CSEC Document about Olympia and the version from the Globe and Mail. The Snowden one is both cropped and full of artefacts of compression (or something.)

One of the points that both Snowden and the following speakers made is that the massive SIGINT system set up doesn’t prevent terrorist attacks, it can be used retrospectively to look back at some event and figure out who did it or develop intelligence about a someone targeted. One of the speakers followed up on the implications of retrospective surveillance – what this means for citizens is that things you do now might come back to haunt you.

Why Watching the Watchers Isn’t Enough: Michael Geist

Michael Geist gives a good talk on Why Watching the Watchers Isn’t Enough. This talk was part of a symposium on Pathways To Privacy.

Geist’s point is that oversight is not enough. Those who now provide oversight have come out to say that they are on the job and that the CSE’s activities are legal. That means that oversight isn’t really working. The surveillance organizations and those tasked with oversight seem to be willfully ignoring the interpretation of experts that the gathering and sharing of metadata is the gathering and sharing of information about Canadians.

He talked about how C-51 affects privacy allowing information sharing way beyond what is needed for counter-terrorism. C-51 puts in place a legal framework for which no amount of oversight will make a difference. C-51 allows information to be shared between agencies about “activities that undermine the security of Canada.” An opinion piece in the Toronto Star by Craig Forcese and Kent Roach of antiterrorlaw.ca suggests that this could be interpreted as license to spy on students protesting tuition fees without municipal permission, eco-activists protesting illegally and so on.

Canadian Spies Collect Domestic Emails in Secret Security Sweep

The Intercept and CBC have been collaborating on stories based on documents leaked by Edward Snowden. One recent story is about how Canadian Spies Collect Domestic Emails in Secret Security Sweep. CSE is collecting email going to the government and flagging suspect emails for analysts.

An earlier story titled CSE’s Levitation project: Expert says spy agencies ‘drowning in data’ and unable to follow leads, tells about the LEVITATION project that monitors file uploads to free file hosting sites. The idea is to identify questionable uploads and then to figure out who is uploading the materials.

Glenn Greenwald (see the embedded video) questions the value of this sort of mass surveillance. He suggests that mass surveillance impedes the ability to find terrorists attacks. The problem is not getting more information, but connecting the dots of what one has. In fact the slides that you can get to from these stories both show that CSE is struggling with too much information and analytical challenges.