Upload datasets, generate reports, and download them in seconds!
OpenAI has just released a plug-in called Code Interpreter which is truly impressive. You need to have ChatGPT Plus to be able to turn it on. It then allows you to upload data and to use plain English to analyze it. You write requests/prompts like:
What are the top 20 content words in this text?
It then interprets your request and describes what it will try to do in Python. Then it generates the Python and runs it. When it has finished, it shows the results. You can see examples in this Medium article:
I’ve been trying to see how I can use it to analyze a text. Here are some of the limitations:
It can’t handle large texts. This can be used to study a book length text, not a collection of books.
It frequently tries to load NLTK or other libraries and then fails. What is interesting is that it then tries other ways of achieving the same goal. For example, I asked for adjectives near the word “nature” and when it couldn’t load the NLTK POS library it then accessed a list of top adjectives in English and searched for those.
It can generate graphs of different sorts, but not interactives.
It is difficult to get the full transcript of an experiment where by “full” I mean that I want the Python code, the prompts, the responses, and any graphs generated. You can ask for a iPython notebook with the code which you can download. Perhaps I can also get a PDF with the images.
The Code Interpreter is in beta so I expect they will be improving it. It is none the less very impressive how it can translate prompts into processes. Particularly impressive is how it tries different approaches when things fail.
Code Interpreter could make data analysis and manipulation much more accessible. Without learning to code you can interrogate a data set and potentially run other processes. It is possible to imagine an unshackled Code Interpreter that could access the internet and do all sorts of things (like running a paper-clip business.)
A technological whodunit—featuring Parliament, computer scientists, and a tipsy plane flight
Arun sent me a link to a neat story about How Canada Accidentally Helped Crack Computer Translation. The story is by Christine Mitchell and is in the Walrus (June 2023). It describes how IBM got ahold of a magnetic reel tape with 14 years of the Hansard – the translated transcripts of the Canadian Parliament. IBM went on to use this data trove to make advances in automatic translation.
The story mentions the politics of automated translation research in Canada. I have previously blogged about the Booths who were recruited by the NRC to Saskatchewan to work on automated translation. They were apparently pursuing a statistical approach like that IBM took later on, but their funding was cut.
Speaking of automatic translation, Canada had a computerized system, METEO for translating daily weather forecasts from Environment Canada. This ran from 1981 to 2001 and was an early successful implementation of automatic translation in the real world. It came out of work at the TAUM (Traduction Automatique à l’Université de Montréal) research group at the Université de Montréal that was set up in the late 1960s.
Arun sent me the link to a good paper by Jeff Pooley on Surveillance Publishing in the Journal of Electronic Publishing. The article compares what Google does to rank pages based on links to citation analysis (which inspired Brin and Page). The article looks at how both web search and citation analysis have been monetized by Google and citation network services like Web of Science. Now publishing companies like Elsevier make money off tools that report and predict on publishing. We write papers with citations and publish them. Then we buy services built on our citational work and administrators buy services telling them who publishes the most and where the hot areas are. As Pooley puts it,
Siphoning taxpayer, tuition, and endowment dollars to access our own behavior is a financial and moral indignity.
The article also points out that predictive services have been around since before Google. The insurance and credit rating businesses have used surveillance for some time.
Pooley ends by talking about how these publication surveillance tools then encourage quantification of academic work and facilitate local and international prioritization. The Anglophone academy measures things and discovers itself so it can then reward itself. What gets lost is the pursuit of knowledge.
In that sense, the “decision tools” peddled by surveillance publishers are laundering machines—context-erasing abstractions of our messy academic realities.
The full abstract is here:
This essay develops the idea of surveillance publishing, with special attention to the example of Elsevier. A scholarly publisher can be defined as a surveillance publisher if it derives a substantial proportion of its revenue from prediction products, fueled by data extracted from researcher behavior. The essay begins by tracing the Google search engine’s roots in bibliometrics, alongside a history of the citation analysis company that became, in 2016, Clarivate. The essay develops the idea of surveillance publishing by engaging with the work of Shoshana Zuboff, Jathan Sadowski, Mariano-Florentino Cuéllar, and Aziz Huq. The recent history of Elsevier is traced to describe the company’s research-lifecycle data-harvesting strategy, with the aim to develop and sell prediction products to unviersity and other customers. The essay concludes by considering some of the potential costs of surveillance publishing, as other big commercial publishers increasingly enter the predictive-analytics business. It is likely, I argue, that windfall subscription-and-APC profits in Elsevier’s “legacy” publishing business have financed its decade-long acquisition binge in analytics. The products’ purpose, moreover, is to streamline the top-down assessment and evaluation practices that have taken hold in recent decades. A final concern is that scholars will internalize an analytics mindset, one already encouraged by citation counts and impact factors.
Thanks to my colleague Yasmeen, I was included in an important CFREF, Bridging Divides – Research and Innovation led by Anna Triandafyllidou at Toronto Metropolitan University. Some of the topics I hope to work on include how information technology is being used to surveil and manage immigrants. Conversely, how immigrants use information technology.
Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.
The Center for AI Safety has issued a very short Statement on AI Risk (see sentence above.) This has been signed by the likes of Yoshua Bengio and Geoffrey Hinton. I’m not sure if it is an alternative to the much longer Open Letter, but it focuses on the warning without any prescription as to what we should do. The Open Letter was criticized many in the AI community, so perhaps CAIS was trying to find wording that could bring together “AI Scientists” and “Other Notable Figures.”
I personally find this alarmist. I find myself less and less impressed with ChatGPT as it continues to fabricate answers of little use (because they are false.) I tend to agree with Elizabeth Renieris who is quoted in this BBC story on Artificial intelligence could lead to extinction, experts warn to the effect that there are a lot more pressing immediate issues with AI to worry about. She says,
“Advancements in AI will magnify the scale of automated decision-making that is biased, discriminatory, exclusionary or otherwise unfair while also being inscrutable and incontestable,” she said. They would “drive an exponential increase in the volume and spread of misinformation, thereby fracturing reality and eroding the public trust, and drive further inequality, particularly for those who remain on the wrong side of the digital divide”.
All the concern about extinction has me wondering if this isn’t a way of hyping AI to make everyone one and every AI business more important. If there is an existential risk then it must be a priority, and if it is a priority then we should be investing in it because, of course, the Chinese are. (Note that the Chinese have actually presented draft regulations that they will probably enforce.) In other words, the drama of extinction could serve the big AI companies like OpenAI, Microsoft, Google, and Meta in various ways:
The drama could convince people that there is real disruptive potential in AI so they should invest now! Get in before it is too late.
The drama could lead to regulation which would actually help the big AI companies as they have the capacity to manage regulation in ways that small startups don’t. The big will get bigger with regulation.
Last week I gave the 2023 Annual Public Lecture in Philosophy. You can Watch a Recording here. The talk was on The Eliza Effect: Data Ethics for Machine Learning.
I started the talk with the case of Kevin Roose’s interaction with Sydney (Microsoft’s name for Bing Chat) where it ended up telling Roose that it loved him. From there I discussed some of the reasons we should be concerned with the latest generation of chatbots. I then looked at the ethics of LAION-5B as an example of how we can audit the ethics of projects. I ended with some reflections on what an ethics of AI could be.
AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts. These protocols should ensure that systems adhering to them are safe beyond a reasonable doubt.
This letter to AI labs follows a number of essays and opinions that maybe we are going too fast and should show restraint. This in the face of the explosive interest in large language models after ChatGPT.
Gary Marcus wrote an essay in his substack on “AI risk ≠ AGI risk” arguing that just because we don’t have AGI doesn’t mean there isn’t risk associated with the Mediocre AI systems we do have.
We have summoned an alien intelligence. We don’t know much about it, except that it is extremely powerful and offers us bedazzling gifts but could also hack the foundations of our civilization. We call upon world leaders to respond to this moment at the level of challenge it presents. The first step is to buy time to upgrade our 19th-century institutions for a post-A.I. world and to learn to master A.I. before it masters us.
The editors-in-chief of Nature and Science told Nature’s news team that ChatGPT doesn’t meet the standard for authorship. “An attribution of authorship carries with it accountability for the work, which cannot be effectively applied to LLMs,” says Magdalena Skipper, editor-in-chief of Nature in London. Authors using LLMs in any way while developing a paper should document their use in the methods or acknowledgements sections, if appropriate, she says.
It makes sense to document use, but why would we document use of ChatGPT and not, for example, use of a library or of a research tool like Google Scholar? What about the use of ChatGPT demands that it be acknowledged?
Sinykin talks about this as an “act as groundbreaking as the research itself” which seems a bit of an exaggeration. It is important that data is being reviewed and published, but it has been happening for a while in other fields. Nonetheless, this is a welcome initiative, especially if it gets attention like the LARB article. In 2013 the Tri-Council (of research agencies in Canada) called for a culture of research data stewardship. In 2015 I worked with Sonja Sapach and Catherine Middleton on a report on a Data Management Plan Recommendation for Social Science and Humanities Funding Agencies. This looks more at the front end of requiring plans from people submitting grant proposals that are asking for funding for data-driven projects, but this was so that data could be made available for future research.
Sinykin’s essay looks at the poetry publishing culture in the US and how white it is. He shows how data can be used to study inequalities. We also need to ask about the privilege of English poetry and that of culture from the Global North. Not to mention research and research infrastructure.