Glossographia

Phrontistic revisions

For those of you who may not know, I run a sister site to this blog, The Phrontistery, which in one form or another has been around since 1996, and which features an online dictionary of rare words, glossaries on various topics, and other language-related resources. While the site has been more or less dormant for a few years – mostly I’ve just been keeping the place tidy without adding any new content, I’ve had a slow(ish) summer and so took the opportunity to get things up and running smoothly there again, with a bunch of new content and a new site layout. Over the years I’ve given a lot of thought to somehow combining the two sites, e.g., by moving Glossographia over there or something, but I’ve never had the energy to figure out how difficult that would be. Let me know if you think that would be a terrible (or great) idea, in which case I don’t have to think about it any more.

Figurative is my middle name

Stephen Figurative Chrisomalis. Has a nice ring to it, don’t you think?

In seriousness, an offhand remark I made to my wife this morning, that “Compliance is my middle name,” led me on a very interesting search for the origins of the figurative use of middle name to refer to a paramount or notable characteristic of a person. The entry for middle name in the OED has been relatively recently updated, and includes numerous instances of this figurative use going back to 1905, where the New York Journal has “For retiring you’re—well, that’s your middle name.” and other quotations going up to the present. I did a little further searching around and was able to find an earlier one going back to 1902, in the Manitoba Free Press, quoting a correspondent from Dawson, Yukon Territory (and you’ve got to know that I love it when I antedate something and it turns out to be Canadian):

Source: Manitoba Free Press (Feb 28, 1902), p. 1

This isn’t quite enough to associate it firmly with the Klondike Gold Rush, but it’s a possibility. I was able to find some others from the early 20th century (mostly American, but no others from Canada) and then onward from there. But I also was intrigued when I looked at the Google Ngram for ‘is my middle name’:

Google Ngram search for 'is my middle name' and variants — Google Ngram search for ‘is my middle name’ and variants

That spike peaking right around 1920 is really interesting; equally interesting is that thereafter, it drops back down to relatively modest levels until the 1960s, and then takes off again, reaching its historical peak in the late 1980s and keeping right on going. Now, it’s clear that the initial rise starts well before World War I, so this isn’t something directly associated with soldiers’ slang or the general mixing of dialects during and after the war, but looking at the Google Books results around 1920, this really seems to have been a fad at the time – most of the uses of “is [pronoun] middle name” are non-literal. But by the time that Agatha Christie wrote, in The Murder of Roger Ackroyd (1926: 144), “‘Modesty is certainly not his middle name.’ ‘I wish you wouldn’t be so horribly American, James.’”, the fad was already on the wane – although it never disappeared entirely, for the next several decades it was quite rare.

I was a little surprised to see that there was no immediate bump related to the country anthem Sixteen Tons, first recorded in 1946, and whose rendition by Tennessee Ernie Ford in 1955 reached #1 on the Billboard charts for six weeks, but its line, “Fightin’ and trouble are my middle name” is in a middle verse and perhaps had little linguistic effect (although it was re-recorded many times throughout the 60s and beyond). The post-1965 bump could equally have been inspired by Bobby Vinton’s 1963 single, “Trouble is My Middle Name“, although it peaked only at #33, and, if I may say so, is not really very good. Regardless of the specific impetus, once it took off, it became strongly idiomatic, and today the phrase has become so well-known that it is covered in TV Tropes and elsewhere. I’m confident that my readers will regale me with their favourite examples.

How do you pronounce Detroit?

As you may know, I work at Wayne State University in Detroit, Michigan. Detroit’s been in the news a lot lately, regarding its bankruptcy and a whole lot of other things that, if I were to start talking about them at any length, would just send me off into a rage. This would not be pretty.

So instead, let’s talk about the language of Detroit. Detroit has two main English language varieties: first, the local variety of African American Vernacular English (AAVE), which has been studied by John Rickford, Geneva Smitherman, Roger Shuy, and others; and second, the variety of English that has undergone the Northern Cities Vowel Shift. Here, Penny Eckert’s work is of the greatest significance, building on the work of Bill Labov, but with a specific focus on southeastern Michigan. Most black Detroiters speak the first variety, and most white, locally-born Detroiters speak the second, with some exceptions. Today I’m going to focus just on the second variety, but I’m not going to talk about the dialect as a whole. Instead, I want to talk about variation, and some innovations I’ve noticed, in the speech of Detroit-area residents in the pronunciation of a single word – the place-name Detroit itself.

Now, if you’re like me, and like most North Americans, you probably pronounce Detroit something like /dɪˈtɹɔɪt/, or (for those unfamiliar with IPA, di-TROIT, two syllables, with second syllable stress and rhyming closely with adroit (sound sample). You can hear a clear example of this pronunciation, for instance, in this commercial for the Detroit Zoo. Other examples could be found pretty readily, so I won’t belabour the point. This is the standard pronunciation of Detroit and the baseline for today’s discussion.

There is a second variant heard locally, which has first syllable primary stress rather than second-syllable stress, so, in other words, and where the unstressed first vowel becomes /i/, so, in other words, /ˈdiˌtɹɔɪt/ (DEE-troit). The second syllable then has secondary stress (it can’t be entirely unstressed or the vowel would have to reduce). You hear this pronunciation sporadically and without any particular association with any class or ethnic group, but it’s less common, and we’ll leave it alone.

Lately, however, I’ve been hearing a third pronunciation in a lot of commercials, local news, and the like, in which white Detroit-area residents pronounce their home city as /dɪˈtɹʌɪt/ or even /dɪˈtɹəɪt/, with the first part of the OI vowel unrounded and fronted almost to a schwa. For the non-phonologically-inclined, what seems to be happening is that di-TROIT starts to sound more like di-TRITE. Here’s a good example from a Youtube video from Detroit Real Estate Investing. If you’re not convinced, try loading up both this video and the one from the Detroit Zoo, and running them one after the other to compare a couple of times. Still not convinced? Try this video for America’s Best Value Inn, for another example. Still not convinced? Try this clip from a video made by a local man, with two very clear examples right in a row.

As far as I’ve heard, this variant is only used by people from the Detroit area who have the Northern Cities shift – i.e., it’s not used by people from Milwaukee or Cleveland or Buffalo. It’s not, as far as I can tell, part of the standard analysis of the Northern Cities shift or the specific changes found in the Detroit area, after a lot of time poking around the sound files on Penny Eckert’s website. I haven’t noticed this with any other words that contain the diphthong oi. I wondered whether it might be typical of other words ending in oit or in oi followed by other voiceless stops (e.g. p, t, k). But the problem is that very few words in English end in oit (really just exploit, which is moderately uncommon, and quoit and adroit, both of which are very rare) or oi followed by any voiceless stop consonant (we could add voip and hoick but that’s about it). So we don’t have a lot of other words to compare it to, without going and doing some sociolinguistic research. (This is a hint to any of my future students who may be reading this post).

I can’t find any publication discussing this phenomenon – I’m not a phonologist and would love to hear from someone who could link this up to the Northern Cities shift more broadly. I don’t have any explanation for it, but it’s widespread enough that it deserves some attention.

But wait – we’re not done yet! The reason I wondered about the role of voiceless stops is that while I work in Detroit, I live across the international border in Windsor, Ontario, which is essentially Detroit’s Canadian suburb, and I am a native speaker of Canadian English. Most speakers of Canadian English, including myself, have what’s called the Canadian raising, in which the /aɪ/ diphthong of right and ripe, and the /aʊ/ diphthong of about and house, is raised to /ʌɪ/ or /ʌʊ/ before voiceless consonants and especially voiceless stops – which is why some Americans think that Canadians say aboot or aboat. We don’t, but it might sound that way. And because I, like most Windsorites, have a pretty strong Canadian raising, I pronounce right as /ɹəɪt/ or /ɹʌɪt/, starting with a mid-vowel. Notice that this diphthong is exactly the same as the diphthong in the innovative pronunciation of Detroit. In other words, the ‘oi’ of Detroit for some Detroiters is the same vowel sound as Canadians have in right or trite. They start in completely different diphthongs – /aɪ/ and /ɔɪ/ – but end in the same place.

To make things even more complicated, there is a fourth variant of Detroit used only, as far as I can tell, by older speakers of Canadian English. This is a three-syllable version, /dɪˈtɹɔɪ.ɪt/, or di-TROY-it, to rhyme (non-ironically, I promise!) with destroy it. Most of the users of this variant are Ontario-born native speakers of Canadian English born in the 1950s or earlier. It can be heard, most famously, in the song, The Wreck of the Edmund Fitzgerald by folk singer Gordon Lightfoot, as heard here. It is also typical of the Canadian hockey commentator / blowhard / redneck Don Cherry. I have certainly heard it in Windsor although I suspect it is more common in central and eastern Ontario than here in southwestern Ontario. To this day, and despite all evidence to the contrary over the five years I’ve worked here, my mother (who is of a similar generation, born and raised east of Toronto) refuses to quite believe me that the two-syllable pronunciation is even acceptable or possible. I don’t have any explanation for the emergence of this variant either, but it’s obviously been around for many decades. I’d also love to know whether it’s found widely in any younger Canadian English speakers.

So, in summary, there are four distinct variants of the pronunciation of Detroit, all of which you might hear in the broader Detroit area on any given day:

/dɪˈtɹɔɪt/ (di-TROIT), used by locals and most other English speakers
/ˈdiˌtɹɔɪt/ (DEE-troit), used by locals sporadically
/dɪˈtɹʌɪt/ or /dɪˈtɹəɪt/ (di-TRITE), used by locals who have a strong NCVS
/dɪˈtɹɔɪ.ɪt/ (di-TROY-it), used by (some) older Canadians, including some in Windsor.

Glossographic news of the week

Another busy week of news relevant to readers of this blog:

A few months ago, the Times Literary Supplement reported on the curious case of A.D. Harvey, who was unmasked as a serial creator of false personae who collectively had created a self-perpetuating network of literary fraud, of course all Harvey himself, until the false story of a meeting between Dickens and Dostoyevsky was unmasked and, with it, Harvey himself. Now, this week, the Guardian interviews Harvey, giving a fascinating glimpse into the sort of person who would spend decades creating false identities and fictitious scholarship.

Some of my readers who are keen on cryptography have probably already seen this article in Wired, discussing the fascinating Kryptos Sculpture and its secret decipherment by the NSA years before its official CIA decipherment. The Kryptos Sculpture is one of those things that, in the absence of context, would clearly cause Phaistos 0r Voynich-level excitement in future decipherers.

Not to leave my typography buffs out in the cold, this week the news has been going around about Paul Mathis, the Australian restaurateur who has created a new letter of the alphabet, a logogram ‘Ћ’ for the word ‘the’ to parallel & for ‘and’, at the cost of $38,000. Alas, I don’t see this one catching on, even though the sign already exists in most character sets as a Cyrillic character. I just want to know what costs $38,000 to develop an already-existing character.

The always-fascinating Language Log has a post this week about the fascinating Potosí miners’ language, a mixed language of Spanish, Quechua and Aymara used from the 16th century to this day by miners in central Bolivia. The survival of this fascinating variety is highly dependent on the continuity of traditional mining practices and a multiethnic speech community.

For those of you who have a PhD or are in a doctoral program, you may want to check out this visualization of the lengths of dissertations at the University of Minnesota. My field, anthropology, is second-longest (after history) and has the widest range of any discipline, unsurprisingly in a discipline that spans both natural science and humanities. As for me? My dissertation checks in at 663 pages and is an extreme outlier in any field. Woohoo!

Great news in Native American baseball sociolinguistics: the Arizona Diamondbacks hosted the first-ever baseball game broadcast in Navajo (or indeed, any other Native American language), in honour of their Native American Recognition Day. Now if only we could do something about those pesky mascots elsewhere in the league …

You may have heard this week that J.K. Rowling, the author of the blockbuster Harry Potter series, was unmasked this week as the author of a crime novel, The Cuckoo’s Calling, published under the name of Robert Galbraith. While some of the evidence leading to the break was ordinary sleuthing, there’s a neat discussion in Ben Zimmer’s column in the Wall Street Journal of the role of forensic stylometry, or the linguistic analysis of texts to ascertain authorship, in confirming and breaking the story, with a complementary essay at Language Log by Patrick Juola, who did the analysis, of the science underlying it.

I was so very pleased to see the first post in nearly a year over at the philology blog, Stæfcræft & Vyākaraṇa. And this one is great – a historical linguistic analysis of the Finnish expletive used by Linux developer Linus Torvalds, with digressions into Indo-European mythology. I hate to disagree, with Torvalds, though: there certainly are enough swear words in English, although the Finnish ones sure are fun too!

Stephen Houston and Alexandre Tokovinine write over at the Maya Decipherment blog about some newly-analyzed earspools and a hair ornament bearing Maya glyphs. It’s a shame to not have any provenance on these, part of the great tragedy that is looting in Mayan archaeology, but fascinating nonetheless to see Maya writing outside of monuments and codices, in a decorative context.

Lastly, Stefan Fatsis, the author of Word Freak and general expert on Scrabble, writes this week in the New York Times about the decision by Hasbro to fold the National Scrabble Association, effectively ending its sponsorship of competitive Scrabble. While the immediate effect may be slight – Hasbro’s commitment has been waning for several years and the independent NASPA is going strong, as far as I know – it’s sad to see the abdication of responsibility among game manufacturers for the cultures that keep them vibrant.

Conservative skewing in Google N-gram frequencies

Google Ngram Viewer is a great tool, especially for rough-and-ready searching and visualization of linguistic trends, and as a teaching tool to introduce students to lots of interesting questions we can ask about language variation and patterning. I use it all the time. The default search parameters are for 1800 – 2000, and the Culturomics project notes that, “the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project.” Elsewhere, the Culturomics FAQ notes that, “Before 2000, most of the books in Google Books come from library holdings. But when the Google Books project started back in 2004, Google started receiving lots of books from publishers. This dramatically affects the composition of the corpus in recent years and is why our paper doesn’t use any data from after 2000.”

OK, so we’ve been warned that the data from before 2000 is very different than the data from after 2000, and especially that 2004 marked a significant change in the corpus. Caveat lector, or whatever you will. But I want to know: In what ways have these ‘subtle changes’ changed the Google N-gram corpus, and therefore, what biases in word frequencies do scholars of language need to account for?

Lately, I’ve had some interest in post-2000 changes in word frequencies for my Lexiculture class project for the fall, and so I’ve been looking at N-gram data going up to 2008 (the last date you can search). I have found some very weird declines in words that probably aren’t actually declining in relative frequency:

It seems notable that all of these words start to decline shortly after 2000, with a particularly steep decline right around 2004-05. All of these words, I would argue, should be stable or increasing in frequency: these are words associated with modern technology and social life. Conversely, many timeless words (e.g., table, lamp, daughter) are flat or rising after 2000. It’s possible that intuitions about what should be happening to words can be wrong. But why are they all wrong in the same direction, and why do they all decline all at the same time?

– One possibility is that the data from 2000 onward aren’t complete yet. There could be some books published over the past few years that haven’t been integrated into Google Books and thus don’t end up in the Ngram viewer. But in any case, n-grams measure a word’s frequency relative to all words published in that year, so the fact that the collection isn’t complete should not affect relative word frequencies at all.

– It’s possible that Google Books has systematically missed archiving books oriented towards technology, but why would that be the case? In fact, if tech-savvy publishers are more likely to submit their works to Google Books (which I think is plausible) than your average publisher, the effect should be to increase these words’ frequency.

– It’s possible that, in the absence of the controlled digitization of books from libraries that characterized the early period of Google Books digitization, and the work done to manage metadata in creating the N-Gram Viewer’s early dataset, massive error has crept into the database. But again, why would this affect particularly modern words negatively, while not affecting words whose frequencies has not been changed?

I think I have a better answer. I think that the N-gram Viewer may be skewed, not because anything significant is being missed, but because something significant is being added. There is a growing tendency for cheap electronic reprints of public domain books to come out and be immediately included in Google Books, with the publication date listed as the date of its electronic reprinting. If Levi Leonard Conant’s book The Number Concept (1896) is scanned and reprinted by Echo Books in 2007, the Google Books metadata doesn’t recognize it as an 1896 book at all. It’s digitized and scanned twice, once (correctly) as an 1896 book and again as a 2007 book. In fact, because it’s in the public domain, I could make my own e-book version for sale as a 2013 book and have it listed again. And while that’s not likely to have a huge effect, imagine every reprint of A Tale of Two Cities or Wuthering Heights that has flooded the market since the invention of e-books, stimulated by and reinforced by projects like Google Books.

Now, I suppose there is a case to be made that the 2007 reprint of Conant is, in some way, a 2007 book. After all, reprints have never been excluded from Google Books and there are plenty of pre-electronic 20th century reprints of Wuthering Heights in the corpus. But each of those earlier reprints represents a costly decision by a publisher that a particular book is important enough and will be read widely enough to warrant its republication. From a ‘culturomics’ perspective, there’s a case to be made that these reprints really constitutes a cultural ‘signal’ in the year of its reprinting, and from a linguistic perspective, we presume that lots of readers will read the words, no matter if they are obsolescent at the time. But as the cost of producing reprints as e-books (or print-on-demand) declines, the ‘culturomic’ value of these books also declines, because publishers no longer need to be concerned about whether many (or even any) people buy these books. The author is long dead, so there are no royalties, and there are no or minimal up-front publishing costs. So Google Books is now being flooded with material that may be largely unread and does not reflect the linguistic or cultural values of the time. Its primary effect, for the N-gram viewer, is to skew relative word frequencies in a way that makes 2013 resemble 1913 more than it actually does. That’s a conservative bias, for those following along at home.

We can then derive a couple of corollaries to check if this theory is correct:

– There are likely to be some words that, while still increasing in frequency, do not increase in frequency quite as much as their actual use should indicate. These are words that have shot up out of nowhere over the past few years, and are continuing to accelerate, but their N-gram shows a tapering off. We see a great example of this in a word like transgender, where we see, right around 2004, a clear decline in the acceleration of its frequency, counter to expectations.

– If some word frequencies are artificially depressed, some other word frequencies must be artificially inflated. But which ones? There are likely to be other words that were very common in the 19th and early 20th centuries (the period where most of these reprints are going to come from), but have been on the decline for a long time and are now quite rare, that show an apparent ‘rejuvenation’ after 2004. Again, we find such a word: negro (uncapitalized), which is virtually non-existent in contemporary written English but was at its peak in the period from 1880-1920, and which shows a clear ‘bump’ after 2004 which can’t possibly be real. You can even see this to a lesser degree with a word like honesty, which (for reasons perhaps best left unanalyzed) had been in decline throughout the 20th century but experiences a bump, again, right around 2004.

In summary, because the Google Books corpus today is derived largely from publisher submissions, and because there is a major signal coming from reprints of public domain books published before 1922, n-grams from 2004 onward (and, to a lesser degree, from 2000-2004) are skewed to make modern words appear more infrequent than they actually are, and obsolescent words more common than they are. The moral is not that Google is evil or conservative or that culturomics is stupid or that the N-gram Viewer is fatally flawed. I do think, nonetheless, that we ought to be aware that the specific kinds of unintentional skewing that are being produced are ones that tend, in a conservative direction, to replicate the linguistic and cultural values of a century ago. This problem is not going away, absent a systematic effort to eliminate reprints from some future N-gram dataset, and it may even be getting worse as electronic reprints become more and more common. Stick to the pre-2000 data, though, just like they advise, and you’ll be in good shape.

Thanks to Julia Pope for her consultation and assistance on aspects of Google Books metadata and cataloging practices.

Share this:

Share this:

Share this:

Share this:

Share this: