Data Management Plan Recommendation

Today I deposited a Data Management Plan Recommendation for Social Science and Humanities Funding Agencies (http://hdl.handle.net/10402/era.42201in our institutional repository ERA. This report/recommendation was written by Sonja Sapach with help from me and Catherine Middleton. We recommended that:

Agencies that fund social science and humanities (SSH) research should move towards requiring a Data Management Plan (DMP) as part of their application processes in cases where research data will be gathered, generated, or curated. In developing policies, funding agencies should consult the community on the values of stewardship and research that would be strengthened by requiring DMPs. Funding agencies should also gather examples and data about reuse of archived data in the social sciences and humanities and encourage due diligence among researchers to make themselves aware of reusable data.

On the surface the recommendation seems rather bland. SSHRC has required the deposit of research data they fund for decades. The problem, however, is that few of us pay attention because it is one more thing to do, and something that shares hard-won data with others that you may want to continue milking for research. What we lack is a culture of thinking of the deposit of research data as a scholarly contribution the way the translation and edition of important cultural texts is. We need a culture of stewardship as a TC3+ (tri-council)  document put it. See Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada (PDF).

Given the potential resistance of colleagues it is important that we understand the arguments for requiring planning around data management and that is one of the things we do in this report. Another issue is how to effectively require at the funding proposal end something (like a Data Management Plan) that would show how the researchers are thinking through the issue. To that end we document the approaches of other funding bodies. The point is that this is not actually that new and some research communities are further ahead.

At the end of the day, what we really need is a recognition that depositing data so that it can be used by other researchers is a form of scholarship. Such scholarship can be assessed like any other scholarship. What is the data deposited and what is its quality? How is the data deposited? How is it documented? Can it have an impact?

You can find this document also at Catherine Middleton’s web site and Sonja Sapach’s web site.

 

Medical Privacy Under Threat in the Age of Big Data

The Intercept has a good introductory story about Medical Privacy Under Threat in the Age of Big Data. I was surprised how valuable medical information is. Here is a quote:

[h]e found a bundle of 10 Medicare numbers selling for 22 bitcoin, or $4,700 at the time. General medical records sell for several times the amount that a stolen credit card number or a social security number alone does. The detailed level of information in medical records is valuable because it can stand up to even heightened security challenges used to verify identity; in some cases, the information is used to file false claims with insurers or even order drugs or medical equipment. Many of the biggest data breaches of late, from Anthem to the federal Office of Personnel Management, have seized health care records as the prize.

The story mentions Latanya Sweeny, who is the Director of the Data Privacy Lab at Harvard. She did important research on Discrimination in Online Ad Delivery and has a number of important papers on health records like a recent work on Matching Known Patients to Health Records in Washington State Data that showed that how one could de-anonymize Washington State health data that is for sale by search news databases. We are far more unique than we think we are.

I should add that I came across an interesting blog post by Dr Sweeny on Tech@FTC arguing for an interdisciplinary field of Technology Science. (Sweeny was the Chief Technologist at the FTC.)

Depositing Archives

We have recently deposited two research archives here at the University of Alberta. One is the John B. Smith Archive. You can download bundles or the complete archive which can be found at http://hdl.handle.net/10402/era.41201. Amy Dyrbye and I worked with John B. Smith to assemble this, document it and deposit it in ERA (the Education and Research Archive).

Another archive that we are building is a collection around Gamergate. The DOI for this is:

doi:10.7939/DVN/10253

For this we are using Dataverse that allows us to manage the archive and publish some parts or not.

Given the work that goes into developing and documenting these archives I would argue that they should be considered scholarly work, but that is another matter.

KIAS shrinks carbon footprints “Around The World”

The Office of Sustainibility at the University of Alberta has recognized our work at the Kule Institute for Advanced Study to develop models for sustainable research. They have published a nice story about the Around the World conference that we run with the title, KIAS shrinks carbon footprints “Around The World”. The question we need to ask ourselves is whether our academic reward system isn’t encouraging flying to conferences where other means of meeting would work. What would it mean to do sustainable research?

diyMatrix: Bertin’s Manual

bertin machine

I have long been interested in Jacques Bertin, a pioneer in thinking about visualization. His Semiology of Graphics is a classic. I had been thinking it would be great to try or simulate his way of doing cluster analysis with physical matrices which he called “dominos”. I was therefore pleased to see that someone has recreated his matrices, see DIY Matrix.

Charles Perin, Pierre Dragicevic, and Jean-Daniel Fekete have updated the matrices and fabricated a version for a CHI’15 workshop on Investigating the Challenges of Making Data Physical (PDF).

Update: They also have a web application called Bertifier that allows you to try it virtually. This interactive allows you to choose different ways of decorating the blocks and will then also reorder them. It is fascinating to play with.

interactiveBertin

Now I have something I want to print on a fabricator.

The size of the World Wide Web

sizeofweb

Reading a paper by Lev Manovich I came across a reference to the web site WorldWideWebSize.com which graphs the size of the World Wide Web. The web site searches Google and Bing daily for different words from a corpus and then uses the total results to estimate the size of the web.

When you know, for example, that the word ‘the’ is present in 67,61% of all documents within the corpus, you can extrapolate the total size of the engine’s index by the document count it reports for ‘the’. If Google says that it found ‘the’ in 14.100.000.000 webpages, an estimated size of the Google’s total index would be 23.633.010.000.

In the screen grab above you can see that the estimated size can change dramatically over time.  Hard to tell why.

Around the World Conference

ATW_Logo

Last week we held our third Around the World Conference on the subject of “Big Data”. We had some fabulous panels from countries including Ireland, Canada, Israel, Nigeria, Japan, China, Australia, USA, Belgium, Italy, and Brazil.

The Around the World Conference streams speakers and panels from around the world out to everyone on the net. We also edit and archive the video clips. This model allows for a sustainable conversation across continents that doesn’t involve flying people around. It allows a lot people who wouldn’t usually be included to speak. We also find there are technical hiccups, but that happens in on-site conferences too.

Editorialisation Et Nouvelles Formes De Publication

In the last couple of weeks I’ve been at two interesting conferences and took research notes.

  1. I gave a keynote on “Big Data and the Humanities” at the Northwestern Research Computation Day (link to my research notes). I gave a lot of examples of projects and visualizations.
  2. At the Éditorialisation Et Nouvelles Formes De Publication (link to my research notes) conference I spoke about “Publishing Tools: A Theatre of Machines”. I showed how text analysis machines have evolved.

TSA’s Secret Behavior Checklist to Spot Terrorists

The Intercept has published the TSA’s behaviour checklist for spotting terrorists as part of two stories. See, Exclusive: TSA’s Secret Behavior Checklist to Spot Terrorists. The Spot Referral Report includes all sorts of behaviours like “Arrives late for flight …”. The idea of the report is that behaviours are assigned points and if someone gets more than a certain number of points the suspect is referred to a Law Enforcement Officer (LEO). The checklist is part of a SPOT (Screening of Passengers by Observation Techniques) Referral Report that is filled out when someone is “spotted” by the TSA. A second story from the Intercept claims that Exclusive: TSA ‘Behavior Detection’ Program Targeting Undocumented Immigrants, Not Terrorists.

Is it Research or is it Spying? Thinking-Through Ethics in Big Data AI and Other Knowledge Sciences

Is it Research or is it Spying? Thinking-Through Ethics in Big Data AI and Other Knowledge Sciences has just been published online. It was written with Bettina Berendt and Marco Büchler and came out of a Dagschule retreat where a group of us started talking about ethics and big data. Here is the abstract:

How to be a knowledge scientist after the Snowden revelations?” is a question we all have to ask as it becomes clear that our work and our students could be involved in the building of an unprecedented surveillance society. In this essay, we argue that this affects all the knowledge sciences such as AI, computational linguistics and the digital humanities. Asking the question calls for dialogue within and across the disciplines. In this article, we will position ourselves with respect to typical stances towards the relationship between (computer) technology and its uses in a surveillance society, and we will look at what we can learn from other fields. We will propose ways of addressing the question in teaching and in research, and conclude with a call to action.

A PDF of our author version is here.