IEEE Spectrum has an interview with Michael Jordan that touches on the Delusions of Big Data and Other Huge Engineering Efforts. He is worried about the white noise or false positives. If a dataset is big enough you can always find something to correlate with what you want. That doesn’t mean it is causal or informatively correlated. He predicts a “big-data winter” after the bubble of excitement pops.
After a bubble, when people invested and a lot of companies overpromised without providing serious analysis, it will bust. And soon, in a two- to five-year span, people will say, “The whole big-data thing came and went. It died. It was wrong.” I am predicting that.
Nate Hoffelder on The Digital Reader blog has broken a story about how Adobe is Spying on Users, Collecting Data on Their eBook Libraries. He and Arts Technica report that the Adobe’s Digital Editions 4 send data home about what you read and how far (what page) you get to. The data is sent in plain text.
Hoffelder used a tool called Wireshark to look at what was being sent out from his computer.
On Thursday I heard a great talk by Ashley Esarey on “Understanding Chinese Information Control and State Preferences for Stability Maintenance.” He has been studying a dataset of over 4,000 censorship directives issued by the Chinese state to website administrators to do things like stop mentioning Obama’s inauguration in headlines or to delete all references to certain issues. I hadn’t realized how hierarchical and human the Chinese control of the internet was. Directives came from all levels and seem to also have been ignored.
In his talk Esarey mentioned how the China Digital Times has been tracking various internet censorship issues in China. At that site I found some fascinating stories and lists of words censored. See:
From The Intercept I followed a link to a Buzzfeed Exclusive: Hundreds Of Devices Hidden Inside New York City Phone Booths. Buzzfeed found that the company that manages the advertising surrounding New York phone booths had installed beacons that could interact with apps on smartphones as the passed by. The beacons are made by Gimbal which claims to have “the world’s largest deployment of industry-leading Bluetooth Smart beacons…” The Buzzfeed article describes what information can be gathered by these beacons:
Gimbal has advertised its “Profile” service. For consumers who opt in, the service “passively develops a profile of mobile usage and other behaviors” that allow the company to make educated guesses about their demographics “age, gender, income, ethnicity, education, presence of children”, interests “sports, cooking, politics, technology, news, investing, etc”, and the “top 20 locations where [the] user spends time home, work, gym, beach, etc..”
The image above is from Buzzfeed who got it from Gimbal and it illustrates how Gimbal is collecting data about “sightings” that can be aggregated and mined both by Gimbal and by 3rd parties who pay for the service. Apple is however responsible for an important underlying technology, iBeacon. If you want the larger picture on beacons and the hype around them see the BEEKn site (which is about “beacons, brands and culture on the Internet of Things) or read about Apple’s iBeacon technology. I am not impressed with the use cases described. They are mostly about advertisers telling us (without our permission) about things on sale. They can be used for location specific (very specific) information like the Tulpenland (tulip garden) app but outdoors you can do this with geolocation. A better use would be indoors for museums where GPS doesn’t work as Prophets Kitchen is doing for the Rubens House Antwerp Museum though the implementation shown looks really lame (multiple choice questions about Rubens!). The killer app for beacons has yet to appear, though mobile payments may be it.
What is interesting is that the Intercept article indicates that users don’t appreciate being told they are being watched. It seems that we only mind be spied on when we are personally told that we are being spied on, but that may be an unwarranted inference. We may come to accept a level of tracking as the price we pay for cell phones that are always on.
In the meantime New York has apparently ordered the beacons removed, but they are apparently installed in other cities. Of course there are also Canadian installations.
The folks behind the Google Ngram Viewer have developed a new tools called bookworm. It has a number of corpora (the example above is from bills from beta.congress.gov.) It lets you describe more complex queries and you can upload your own data.
Bookworm is hosted by the Cultural Observatory at Harvard directed by Erez Lieberman Aiden and Jean-Baptiste Michel who were behind the NGgam Viewer. They have recently published a book Uncharted where they talk about different cultural trends they studied using the NGram Viewer. The book is accessible though a bit light.
Evgeny Morozov has a nice essay in Le Monde Diplomatique (English Edition, August 2014) on Whilst you whistle in the shower: How much for your data? (article on LMD here). He raises questions about the monetization of all of our data and how we are willing to give up more and more data. He describes the limited options being debated on the issue of data and privacy,
the future offered to us by Lanier and Pentland fits into the German ordoliberal tradition, which sees the preservation of
market competition as a moral project, and treats all monopolies as dangerous. The Google approach fits better with the American school of neoliberalism that developed at the University of Chicago. Its adherents are mostly focused on efficiency and consumer welfare, not morality; and monopolies are never assumed to be evil just because they are monopolies, some might be socially beneficial.
The essay covers some of the same ground that Mike Bulajewski covered in The Cult of Sharing about how the gift economy rhetoric is being hijacked by monetization interests.
Since established taxi and hotel industries are detested, the public
debate has been framed as a brave innovator taking on sluggish,
monopolistic incumbents. Such skewed presentation, while not inaccurate
in all cases, glosses over the fact that the start-ups of the sharing
economy operate on the pre-welfare model: social protections for
workers are minimal, they have to take on risks previously assumed by
their employers, and there are almost no possibilities for collective
This week SSHRC announced the new partnership grants awarded including one I am a co-investigator on, NovelTM: Text Mining the Novel.
This project brings together researchers and partners from 21 different academic and non-academic institutions to produce the first large-scale quantitative history of the novel. Our aim is to bring new computational approaches in the field of text mining to the study of literature as well as bring the unique knowledge of literary studies to bear on larger debates about data mining and the place of information technology within society.
NovelTM is led by Andrew Piper at McGill University. At the University of Alberta I will be gathering a team that will share the resulting computing methods through TAPoR and developing recipes or tutorials so that others can try them.
The Upshot in the New York Times has a nice article titled In One America, Guns and Diet. In the Otehr, Cameras and ‘Zoolander.’: Inequality and Web Search Trends by David Leonhardt (August 18, 2014). They combined data from Google on favorite searches by county with socio economic data to show what searches correlate with the richer and poorer areas. While few of the correlations are surprising they provide details that one wouldn’t think of. Not only are religious searches more common in poorer areas, but so are searches for “about hell” and “antichrist.” In wealthy areas by contrast they search for “holiday greetings” presumably because they are more likely to live far from family.
Ayway, a neat study that illustrates who the aggregation of different datasets can work.
Over the last month I’ve been to a number of conferences that I have been writing conference notes on.
- At the beginning of July I was at DH 2014 in Lausanne Switzerland where I gave a workshop with Stéfan Sinclair on Your Very Own Voyant, participated in some panels and gave a paper (also with Stéfan).
- I was at a Dagstuhl around data science and digital humanities at the end of July. We had a fascinating conversation. I ended up in a workshop on the ethics of big data which is going to become yet something else I wish I had the time to study properly.
- At the beginning of August I went to a workshop at Waterloo that was in honour of Frank Wm. Tompa, Exploiting Text. This workshop had speakers, including myself, who spoke to issues that Tompa was interested in from dictionaries to algorithms for text retrieval. I was often lost in the algorithm talks but it was fascinating to listen to a different view of text.
Fotis pointed me to this set of tutorials on Text Analysis with Topic Models for the Humanities and Social Sciences. The tutorials are built around Python, but most of it could be done with other tools. While I haven’t followed through the set of tutorials, they look like a great primer on text mining, visualization and interpretation. I particularly like how they include different datasets (British Novels, French plays …) to play with.