Documenting the Now develops tools and builds community practices that support the ethical collection, use, and preservation of social media content.
I’ve been talking with the folks at MassMine (I’m on their Advisory Board) about tools that can gather information off the web and I was pointed to the Documenting the Now project that is based at the University of Maryland and the University of Virginia with support from Mellon. DocNow have developed tools and services around documenting the “now” using social media. DocNow itself is an “appraisal” tool for twitter archiving. They then have a great catalog of twitter archives they and others have gathered which looks like it would be great for teaching.
MassMine is at present a command-line tool that can gather different types of social media. They are building a web interface version that will make it easier to use and they are planning to connect it to Voyant so you can analyze results in Voyant. I’m looking forward to something easier to use than Python libraries.
Speaking of which, I found a TAGS (Twitter Archiving Google Sheet) which is a plug-in for Google Sheets that can scrape smaller amounts of Twitter. Another accessible tool is Octoparse that is designed to scrape different database driven web sites. It is commercial, but has a 14 day trial.
One of the impressive features of Documenting the Now project is that they are thinking about the ethics of scraping. They have a Social Labels set for people to indicate how data should be handled.
Australian students who have raised privacy concerns describe the incident involving a Canadian student as ‘freakishly disrespectful’
The Guardian has a story about CEO of exam monitoring software Proctorio apologises for posting student’s chat logs on Reddit. Proctorio provides software for monitoring (proctoring) students on their own laptop while they take exams. It uses the video camera and watches the keyboard to presumably watch whether the student tries to cheat on a timed exam. Apparently a UBC student claimed that he couldn’t get help in a timely fashion from Proctorio when he was using it (presumably with a timer going for the exam.) This led to Australian students criticizing the use of Proctorio which led to the CEO arguing that the UBC student had lied and providing a partial transcript to show that the student was answered in a timely fashion. That the CEO would post a partial transcript shows that:
staff at Proctorio do have access to the logs and transcripts of student behaviour, and
that they don’t have the privacy protection protocols in place to prevent the private information from being leaked.
I can’t help feeling that there is a pattern here since we also see senior politicians sometimes leaking data about citizens who criticize them. The privacy protocols may be in place, but they aren’t observed or can’t be enforced against the senior staff (who are the ones that presumably need to do the enforcing.) You also sense that the senior person feels that the critic abrogated their right to privacy by lying or misrepresenting something in their criticism.
This raises the question of whether someone who misuses or lies about a service deserves the ongoing protection of the service. Of course, we want to say that they should, but nations like the UK have stripped citizens like Shamina Begum of citizenship and thus their rights because they behaved traitorously, joining ISIS. Countries have murdered their own citizens that became terrorists without a trial. Clearly we feel that in some cases one can unilaterally remove someones rights, including the right to life, because of their behaviour.
Smart software controls the prices and products you see when you shop online – and sometimes it can go spectacularly wrong, discovers Chris Baraniuk.
The BBC has a stroy about The bad things that happen when algorithms run online shops. The story describes how e-commerce systems designed to set prices dynamically (in comparison with someone else’s price, for example) can go wrong and end up charging customers much more than they will pay or charging them virtually nothing so the store loses money.
The story links to an instructive blog entry by Michael Eisen about how two algorithms pushed up the price on a book into the millions, Amazon’s $23,698,655.93 book about flies. The blog entry is a perfect little story about about the problems you get when you have algorithms responding iteratively to each other without any sanity checks.
Vinay Prabhu, chief scientist at UnifyID, a privacy startup in Silicon Valley, and Abeba Birhane, a PhD candidate at University College Dublin in Ireland, pored over the MIT database and discovered thousands of images labelled with racist slurs for Black and Asian people, and derogatory terms used to describe women. They revealed their findings in a paper undergoing peer review for the 2021 Workshop on Applications of Computer Vision conference.
Another one of those “what were they thinking when they created the dataset stories” from The Register tells about how MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs. The MIT Tiny Images dataset was created automatically using scripts that used the WordNet database of terms which itself held derogatory terms. Nobody thought to check either the terms taken from WordNet or the resulting images scoured from the net. As a result there are not only lots of images for which permission was not secured, but also racists, sexist, and otherwise derogatory labels on the images which in turn means that if you train an AI on these it will generate racist/sexist results.
The article also mentions a general problem with academic datasets. Companies like Facebook can afford to hire actors to pose for images and can thus secure permissions to use the images for training. Academic datasets (and some commercial ones like the Clearview AI database) tend to be scraped and therefore will not have the explicit permission of the copyright holders or people shown. In effect, academics are resorting to mass surveillance to generate training sets. One wonders if we could crowdsource a training set by and for people?
Within a few days of the announcement that libraries, schools and colleges across the nation would be closing due to the COVID-19 global pandemic, we launched the temporary National Emergency Library to provide books to support emergency remote teaching, research activities, independent scholarship, and intellectual stimulation during the closures. […]
The blog entry points to what the HathiTrust is doing as part of their Emergency Temporary Access Service which lets libraries that are members (and the U of Alberta Library is one) provide access to digital copies of books they have corresponding physical copies of. This is only available to “member libraries that have experienced unexpected or involuntary, temporary disruption to normal operations, requiring it to be closed to the public”.
It is a pity the IS NEL was discontinued, for a moment there it looked like large public service digital libraries might become normal. Instead it looks like we will have a mix of commercial e-book services and Controlled Digital Lending (CDL) offered by libraries that have the physical books and the digital resources to organize it. The IA blog entry goes on to note that even CDL is under attack. Here is a story from Plagiarism Today:
Though the National Emergency Library may have been what provoked the lawsuit, the complaint itself is much broader. Ultimately, it targets the entirety of the IA’s digital lending practices, including the scanning of physical books to create digital books to lend.
The IA has long held that its practices are covered under the concept of controlled digital lending (CDL). However, as the complaint notes, the idea has not been codified by a court and is, at best, very controversial. According to the complaint, the practice of scanning a physical book for digital lending, even when the number of copies is controlled, is an infringement.
A cache of data reviewed by Reuters provides insight into the operation, detailing tens of thousands of malicious messages designed to trick victims into giving up their passwords that were sent by BellTroX between 2013 and 2020.
It was bound to happen. Reuters has an important story that an Obscure Indian cyber firm spied on politicians, investors worldwide. The firm, BellTroX InfoTech Services, offered hacking services to private investigators and others. While we focus on state-sponsored hacking and misinformation there is a whole murky world of commercial hacking going on.
The Citizen Lab played a role in uncovering what BellTroX was doing. They have a report here about Dark Basin, a hacking-for-hire outfit, that they link to BellTroX. The report is well worth the read as it details the infrastructure uncovered, the types of attacks, and the consequences.
The growth of a hack-for-hire industry may be fueled by the increasing normalization of other forms of commercialized cyber offensive activity, from digital surveillance to “hackingback,” whether marketed to private individuals, governments or the private sector. Further, the growth of private intelligence firms, and the ubiquity of technology, may also be fueling an increasing demand for the types of services offered by BellTroX. At the same time, the growth of the private investigations industry may be contributing to making such cyber services more widely available and perceived as acceptable.
They conclude that the growth of this industry is a threat to civil society.
What is it became so affordable and normalized that any unscrupulous person could hire hackers to harass an ex-girlfriend or neighbour?
The Canadian Comparative Literature Association (CCLA/ACLC) celebrated in 2019 its fiftieth anniversary. The association’s annual conference, which took place from June 2 to 5, 2019 as part of the Congress of the Humanities and Social Sciences of Canada at UBC (Vancouver), provided an opportunity to reflect on the place of comparative literature in our institutions. We organized a joint bilingual roundtable bringing together comparatists and digital humanists who think and put in place collaborative editorial practices. Our goal was to foster connections between two communities that ask similar questions about the modalities for the creation, dissemination and legitimation of our research. We wanted our discussions to result in a concrete intervention, thought and written collaboratively and demonstrating what comparative literature promotes. The manifesto you will read, “Knowledge is a commons – Pour des savoirs en commun”, presents the outcome of our collective reflexion and hopes to be the point of departure for more collaborative work.
Thanks to a panel on the Journal in the digital age at CSDH-SCHN 2020 I learned about the manifesto, Knowledge is a commons – Pour des savoirs en commun. The manifesto was “written colingually, that is, alternating between English and French without translating each element into both languages. This choice, which might appear surprising, puts into practice one of our core ideas: the promotion of active and fluid multilingualism.” This is important.
The manifesto makes a number of important points which I summarize in my words:
We need to make sure that knowledge is truly made public. It should be transparent, open and reversible (read/write).
We have to pay attention to the entire knowledge chain of research to publication and rebuild it in its entirety so as to promote access and inclusion.
The temporalities, spaces, and formats of knowledge making matter. Our tools and forms like our thought should be fluid and plural as they can structure our thinking.
We should value the collectives that support knowledge-making rather than just authoritative individuals and monolithic texts. We should recognize the invisible labourers and those who provide support and infrastructure.
We need to develop inclusive circles of conversation that cross boundaries. We need an ethics of open engagement.
We should move towards an active and fluid multilingualism (of which the manifesto is an example.)
Writing is co-writing and re-writing and writing beyond words. Let’s recognize a plurality of writing practices.
Is “excellence” really the most efficient metric for distributing the resources available to the world’s scientists, teachers, and scholars? Does “excellence” live up to the expectations that academic communities place upon it? Is “excellence” excellent? And are we being excellent to each other in using it?
Rhetoric of excellence – it looks at how there is little consensus around what excellence between disciplines. Within disciplines it is negotiated and can become conservative.
Is “excellence” good for research – the second section argues that there is little correlation between forms of excellence review and long term metrics. They go on to outline some of the unfortunate side-effects of the push for excellence; how it can distort research and funding by promoting competition rather than collaboration. They also talk about how excellence disincentivizes replication – who wants to bother with replication if
Alternative narratives – the third section looks at alternative ways of distributing funding. They discuss looking at “soundness” and “capacity” as an alternatives to the winner-takes-all of excellence.
So much more could and should be addressed on this subject. I have often wondered about the effect of the success rates in grant programmes (percentage of applicants funded). When the success rate gets really low, as it is with many NEH programmes, it almost becomes a waste of time to apply and superstitions about success abound. SSHRC has healthier success rates that generally ensure that most researchers gets funded if they persist and rework their proposals.
Hypercompetition in turn leads to greater (we might even say more shameless …) attempts to perform this “excellence”, driving a circular conservatism and reification of existing power structures while harming rather than improving the qualities of the underlying activity.
Ultimately the “adjunctification” of the university, where few faculty get tenure, also leads to hypercompetition and an impoverished research environment. Getting tenure could end up being the most prestigious (and fundamental) of grants – the grant of a research career.
Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we’ll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.
On the Google Developvers Blog there is an interesting post on Text Embedding Models Contain Bias. Here’s Why That Matters. The post talks about a technique for using Word Embedding Association Tests (WEAT) to see compare different text embedding algorithms. The idea is to see whether groups of words like gendered words associate with positive or negative words. In the image above you can see the sentiment bias for female and male names for different techniques.
While Google is working on WEAT to try to detect and deal with bias, in our case this technique could be used to identify forms of bias in corpora.