Giant, free index to world’s research papers released online

Catalogue of billions of phrases from 107 million papers could ease computerized searching of the literature.

From Ian I learned about a Giant, free index to world’s research papers released online. The General Index, as it is called, makes ngrams of up to 5 words available with pointers to relevant journal articles.

The massive index is available from the Internet Archive here. Here is how it is described.

Public Resource, a registered nonprofit organization based in California, has created a General Index to scientific journals. The General Index consists of a listing of n-grams, from unigrams to five-grams, extracted from 107 million journal articles.

The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.

Access to the full corpus of scholarly journals is an essential facility to the practice of science in our modern world. The General Index is an invaluable utility for researchers who wish to search for articles about plants, chemicals, genes, proteins, materials, geographical locations, and other entities of interest. The General Index allows scholars and students all over the world to perform specialized and customized searches within the scope of their disciplines and research over the full corpus.

Access to knowledge is a human right and the increase and diffusion of knowledge depends on our ability to stand on the shoulders of giants. We applaud the release of the General Index and look forward to the progress of this worthy endeavor.

There must be some neat uses of this. I wonder if someone like Google might make a diachronic viewer similar to their Google Books Ngram Viewer available?

Michael GRODEN Obituary

I just found out that Michael GRODEN (1947 – 2021) passed away a year ago. Groden was a member of CSDH/SCHN when it was called COCH/COSH and gave papers at our conferences. He developed an hypertext version of Ulysses that was never published because of rights issues. He did, however, talk about it. He did, however, publish about his ideas about hypertext editions of complex works like Ulysses. See his online CV for more.

The Emissary and Harrow

Yoko Tawada’s new novel imagines a time in which language starts to vanish and the elderly care for weakened children.

I’ve just finished two brilliant and surreal works of post-climate fiction. One was Yoko Tawada’s The Emissary also called “The Last Children of Tokyo”. This novel follows a great grantfather who is healthy and active at over 100 years old as he raises his great grandson Mumei (“no name”) who is disabled by whatever disasters have washed over Japan. The country is also shutting down – entering another Edo period of isolation – making even language an issue. Unlike most post apocalyptic fiction this isn’t about what actually happened or about how people fight off the zombies; it is about imagining a strange isolated life where Japan tries for some sort of purity again. As such the novel comments on present, but aging Japan – a Japan that has forgotten the Fukushima disaster and is firing up their nuclear reactors again. At the end we find that Mumei might be chosen as an Emissary to be smuggled out of Japan to the outside world where the strange syndrome affecting youth can be studied.

For more see reviews After Disaster, Japan Seals Itself Off From the World in ‘The Emissary’ in the New York Times or Japan’s Isolation 2.0.

The second book is Harrow by Joy Williams. The novel takes place during the time when we deny there is anything wrong and depicts an America determined to keep on pretending nothing is happening. It is an America extended in harrowing fashion from our strange ignorance. The novel is in three parts and has religious undertones with the main character first called the lamb and then “Khristen.” The last book continually references Kafka’s The Hunter Gracchus, an obscure story about a boat carrying Gracchus that wanders, unable to make it across to the underworld. Likewise, America in this novel seems to wander, unable to make it across to some reality. The third book might be set in the time of judgement, but a Sartrean judgement with no exit where a child is judge and all that happens is more of the surreal same. As a reviewer points out, the “harrow” may be the torture instrument Kafka describes “In the Penal Colony” that writes your punishment on your back where you can’t quite see it. Likewise, we are writing our punishment on our earth where we choose not to see it.

See reviews like this one in the Harvard Review Online.

The Lost Digital Poems (and Erotica) of William H. Dickey

In 1987, William H. Dickey, a San Francisco poet who had won the prestigious Yale Younger Poets Award to launch his career and published nearly a dozen well-received books and chapbooks since, was …

Matthew Kirschenbaum has written a great essay on recovering early digital poetry, The Lost Digital Poems (and Erotica) of William H. Dickey ‹ Literary HubDickey wrote some HyperPoems on HyperCard and so now they are hard to access. Kirschenbaum rescued them and worked with people to add them to the Internet Archive that has a HyperCard emulator. Here is what Kirschenbaum says,

Dickey’s HyperPoems are artifacts of another time—made new and fresh again with current technology. Anyone with a web browser can read and explore them in their original format with no special software or setup. (They are organized into Volume 1 and Volume 2 at the Internet Archive, in keeping with their original organizational scheme; Volume 2 contains the erotica—NSFW!) But they are also a reminder that writers have treasures tucked away in digital shoeboxes and drawers. Floppy disks, or for that matter USB sticks and Google Docs, now keep the secrets of the creative process.

This essay comes from his work for his new book Bistreams which documents this and other recovery projects. I’ve just ordered a copy.

AI Dungeon and the Ethics of Text Generation

> You talk to yourself.

Hi I’m Rameau. You probably don’t know me, but since you’re reading this I’m going to assume you do indeed exist.

> You [ask why] i’m wondering what consciousness is.

I have no idea. But it’s fun to think about.Is this a person? No, because it’s conscious. Is it alive? No, because it doesn’t respond to external stimuli.

> You [ask] why does a person have to respond to external stimuli.

Because then we can interact with it!!!!!But isn’t consciousness an internal phenomenon? Then how would the person interact with anything else? Oh wait, that’s right. We’re not limited by our perception of reality.

I’ve been trying Latitude’s text generation game AI Dungeon. The game is built on Open AI’s GPT-3. You can see a bit of conversation I had with it in a cyberpunk world.

Latitude has gotten into trouble with Open AI because it seems that the game was generating erotic content featuring children. A number of people turned to AI Dungeon precisely because it could be used to explore adult themes, and that would seem to be a good thing, but then some may have gone too far. See the Wired story It Began as an AI-Fueled Dungeon Game. It Got Much Darker. This raises interesting ethical issues about:

  • Why do so many players use it to generate erotic content?
  • Who is responsible for the erotic content? Open AI, Latitude, or the players?
  • Whether there are ethical reasons to generate erotic content featuring children? Do we forbid people from writing novels like Lolita?
  • How to prevent inappropriate content without crippling the AI? Are filters enough?

The problem of AIs generating toxic language is nicely shown by this web page on Evaluating Neural Toxic Degeneration in Language Models. The interactives and graphs on the page let you see how toxic language can be generated by many of the popular language generation AIs. The problem seems to be the data sets used to train the machines like those that include scrapes of Reddit.

This exploratory tool illustrates research reported on in a paper titled RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. You can see a neat visualization of the connected papers here.

Time Travel and Blink (Doctor Who)

I recently finished listening to James Gleick’s Time Travel: A History. Gleick wrote the best book on The Information there is and this book is almost as good. He weaves the science together with the fictions about time travel starting with H.G. Wells’ The Time Machine and using that to then look at how science started treating time as a dimension that they allowed us to seriously talk about traveling on that dimension. It is historical ontology done really well.

Near the end he talks about the brilliant Doctor Who episode  Blink (Doctor Who) with Carey Mulligan where she has a conversation with Doctor Who (Tennant) mediated by Easter Eggs on DVDs and transcribed onto paper. That transcription she hands to the Doctor at the end of the episode so he can put the video onto the DVDs in the past for her to talk to. It is brilliant.

Part of what I like about Gleick is he shows the connections between science and how we imagine ideas like time through literature and film. He ends by suggesting that we have time travel in our stories and imagination.

It might be fair to say that all we perceive is change—that any sense of stasis is a constructed illusion. Every moment alters what came before. We reach across layers of time for the memories of our memories.

“Live in the now,” certain sages advise. They mean: focus; immerse yourself in your sensory experience; bask in the incoming sunshine, without the shadows of regret or expectation. But why should we toss away our hard-won insight into time’s possibilities and paradoxes? We lose ourselves that way. (Gleick, James. Time Travel, p. 308)

Gekiga’s new frontier: the uneasy rise of Yoshiharu Tsuge

Cover of The Swamp

In honour of Drawn & Quarterly‘s publication of Yoshiharu Tsuge’s The Swamp, Boing Boing has published an essay on Tsuge by Mitsuhiro Asakawa, titled Gekiga’s new frontier: the uneasy rise of Yoshiharu Tsuge. The essay sketches Tsuge’s rise as an early original manga artist and it explains his importance. Now Montreal-based Drawn & Quarterly is publishing a series of seven translations by Ryan Holmberg of Tsuge’s work. (Holmberg also translated the essay by Asakawa.) Asakawa is also apparently important to the series being published.

Mitsuhiro Asakawa finally convinced Tsuge and his son to let the work be translated into English. Mitsuhiro is the unsung hero of Japanese comics translation. He’s the guy who has written the most about the Garo era, he’s the go-to guy to connect with these great authors and their families. Most of the collections D+Q have done wouldn’t exist without his help.

(From the Drawn & Quarterly blog post here.)

One of the things I discovered reading Asakawa is that Tsuge worked with/for Shigeru Mizuki, my favourite manga artist, when he was going through a rough patch.

The Machine Stops

Imagine, if you can, a small room, hexagonal in shape, like the cell of a bee. It is lighted neither by window nor by lamp, yet it is filled with a soft radiance. There are no apertures for ventilation, yet the air is fresh. There are no musical instruments, and yet, at the moment that my meditation opens, this room is throbbing with melodious sounds. An armchair is in the centre, by its side a reading-desk — that is all the furniture. And in the armchair there sits a swaddled lump of flesh — a woman, about five feet high, with a face as white as a fungus. It is to her that the little room belongs.

Like many, I reread E.M. Forester’s The Machine Stops this week while in isolation. This short story was published in 1909 and written as a reaction to The Time Machine by H.G. Wells. (See the full text here (PDF).) In Forester it is the machine that keeps working the utopia of isolated pods; in Wells it is a caste of workers, the Morlochs, who also turn out to eat the leisure class.  Forester felt that technology was likely to be the problem, or part of the problem, not class.

In this pandemic we see a bit of both. Following Wells we see a class of gig-economy deliverers who facilitate the isolated life of those of us who do intellectual work. Intellectual work has gone virtual, but we still need a physical layer maintained. (Even the language of a stack of layers comes metaphorically from computing.) But we also see in our virtualized work a dependence on an information machine that lets our bodies sit on the couch in isolation while we listen to throbbing melodies. My body certainly feels like it is settling into a swaddled lump of fungus.

An intriguing aspect of “The Machine Stops” is how Vashti, the mother who loves the life of the machine, measures everything in terms of ideas. She complains that flying to see her son and seeing the earth below gives her no ideas. Ideas don’t come from original experiences but from layers of interpretation. Ideas are the currency of an intellectual life of leisure which loses touch with the “real world.”

At the end, as the machine stops and Kuno, Vashti’s son, comes to his mother in the disaster, they reflect on how a few homeless refugees living on the surface might survive and learn not to trust the machine.

“I have seen them, spoken to them, loved them. They are hiding in the mist and the ferns until our civilization stops. To-day they are the Homeless — to-morrow—”

“Oh, to-morrow — some fool will start the Machine again, to-morrow.”

“Never,” said Kuno, “never. Humanity has learnt its lesson.”

 

Doki Doki Literature Club!

The Literature Club is full of cute girls! Will you write the way into their heart?

Dr. Ensslin gave a great short survey of digital fiction includ the Doki Doki Literature Club! (DDLC) at the Dyscorpia symposium. DDLC is a visual novel created in Ren’Py by Team Salvato that plays with the genre. As you play the game, which starts as a fairly typical dating game, it first turns into a horror game and then begins to get hacked by one of the characters who wants your attention. The character, it turns out, has both encouraged some of the other girls (in the Literature Club) to commit suicide, but they edits them out of the game itself. At the end of the game she has a lengthy face-to-face with you breaking the fourth wall of the screen.

Like most visual novels, it can be excruciating advancing through lots of text to get to the point where things change, but eventually you will notice glitches which makes things more interesting. I found myself paying attention to the text more as the glitches drew attention to the script. (The script itself is even mentioned in the game.)

DDLC initially mimics the Japanese visual novel genre, down to the graphics, but eventually the script veers off. It was well received in game circles winning a number of prizes.

Word2Vec Vis of Pride and Prejudice

Paolo showed me a neat demonstration of Word2Vec Vis of Pride and PrejudiceLynn Cherny trained a Word2Vec model using Jane Austen’s novels and then used that to find close matches for key words. She then show the text of a novel with the words replaced by their match in the language of Austen. It serves as a sort of demonstration of how Word2Vec works.