Conservative skewing in Google N-gram frequencies

Google Ngram Viewer is a great tool, especially for rough-and-ready searching and visualization of linguistic trends, and as a teaching tool to introduce students to lots of interesting questions we can ask about language variation and patterning.  I use it all the time.  The default search parameters are for 1800 – 2000, and the Culturomics project notes that, “the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project.” Elsewhere, the Culturomics FAQ  notes that, “Before 2000, most of the books in Google Books come from library holdings. But when the Google Books project started back in 2004, Google started receiving lots of books from publishers. This dramatically affects the composition of the corpus in recent years and is why our paper doesn’t use any data from after 2000.”

OK, so we’ve been warned that the data from before 2000 is very different than the data from after 2000, and especially that 2004 marked a significant change in the corpus. Caveat lector, or whatever you will. But I want to know: In what ways have these ‘subtle changes’ changed the Google N-gram corpus, and therefore, what biases in word frequencies do scholars of language need to account for?

Lately, I’ve had some interest in post-2000 changes in word frequencies for my Lexiculture class project for the fall, and so I’ve been looking at N-gram data going up to 2008 (the last date you can search).  I have found some very weird declines in words that probably aren’t actually declining in relative frequency:

digital

framework

archive

electronic

international

information

technology

gender

interdisciplinary

It seems notable that all of these words start to decline shortly after 2000, with a particularly steep decline right around 2004-05.    All of these words, I would argue, should be stable or increasing in frequency: these are words associated with modern technology and social life. Conversely, many timeless words (e.g., table, lamp, daughter) are flat or rising after 2000. It’s possible that intuitions about what should be happening to words can be wrong. But why are they all wrong in the same direction, and why do they all decline all at the same time?

– One possibility is that the data from 2000 onward aren’t complete yet.  There could be some books published over the past few years that haven’t been integrated into Google Books and thus don’t end up in the Ngram viewer.  But in any case, n-grams measure a word’s frequency relative to all words published in that year, so the fact that the collection isn’t complete should not affect relative word frequencies at all.

– It’s possible that Google Books has systematically missed archiving books oriented towards technology, but why would that be the case?  In fact, if tech-savvy publishers are more likely to submit their works to Google Books (which I think is plausible) than your average publisher, the effect should be to increase these words’ frequency.

– It’s possible that, in the absence of the controlled digitization of books from libraries that characterized the early period of Google Books digitization, and the work done to manage metadata in creating the N-Gram Viewer’s early dataset, massive error has crept into the database.  But again, why would this affect particularly modern words negatively, while not affecting words whose frequencies has not been changed?

I think I have a better answer.  I think that the N-gram Viewer may be skewed, not because anything significant is being missed, but because something significant is being added.  There is a growing tendency for cheap electronic reprints of public domain books to come out and be immediately included in Google Books, with the publication date listed as the date of its electronic reprinting.    If Levi Leonard Conant’s book The Number Concept (1896) is scanned and reprinted by Echo Books in 2007, the Google Books metadata doesn’t recognize it as an 1896 book at all.    It’s digitized and scanned twice, once (correctly) as an 1896 book and again as a 2007 book. In fact, because it’s in the public domain, I could make my own e-book version for sale as a 2013 book and have it listed again.     And while that’s not likely to have a huge effect, imagine every reprint of A Tale of Two Cities or Wuthering Heights that has flooded the market since the invention of e-books, stimulated by and reinforced by projects like Google Books.

Now, I suppose there is a case to be made that the 2007 reprint of Conant is, in some way, a 2007 book.  After all, reprints have never been excluded from Google Books and there are plenty of pre-electronic 20th century reprints of Wuthering Heights in the corpus.   But each of those earlier reprints represents a costly decision by a publisher that a particular book is important enough and will be read widely enough to warrant its republication.  From a ‘culturomics’ perspective, there’s a case to be made that these reprints really constitutes a cultural ‘signal’ in the year of its reprinting, and from a linguistic perspective, we presume that lots of readers will read the words, no matter if they are obsolescent at the time.  But as the cost of producing reprints as e-books (or print-on-demand) declines, the ‘culturomic’ value of these books also declines, because publishers no longer need to be concerned about whether many (or even any) people buy these books.   The author is long dead, so there are no royalties, and there are no or minimal up-front publishing costs.   So Google Books is now being flooded with material that may be largely unread and does not reflect the linguistic or cultural values of the time.  Its primary effect, for the N-gram viewer, is to skew relative word frequencies in a way that makes 2013 resemble 1913 more than it actually does.  That’s a conservative bias, for those following along at home.

We can then derive a couple of corollaries to check if this theory is correct:

– There are likely to be some words that, while still increasing in frequency, do not increase in frequency quite as much as their actual use should indicate.  These are words that have shot up out of nowhere over the past few years, and are continuing to accelerate, but their N-gram shows a tapering off.  We see a great example of this in a word like transgender, where we see, right around 2004, a clear decline in the acceleration of its frequency, counter to expectations.

– If some word frequencies are artificially depressed, some other word frequencies must be artificially inflated.  But which ones?  There are likely to be other words that were very common in the 19th and early 20th centuries (the period where most of these reprints are going to come from), but have been on the decline for a long time and are now quite rare, that show an apparent ‘rejuvenation’ after 2004.  Again, we find such a word: negro (uncapitalized), which is virtually non-existent in contemporary written English but was at its peak in the period from 1880-1920, and which shows a clear ‘bump’ after 2004 which can’t possibly be real.    You can even see this to a lesser degree with a word like honesty, which (for reasons perhaps best left unanalyzed) had been in decline throughout the 20th century but experiences a bump, again, right around 2004.

In summary, because the Google Books corpus today is derived largely from publisher submissions, and because there is a major signal coming from reprints of public domain books published before 1922, n-grams from 2004 onward (and, to a lesser degree, from 2000-2004) are skewed to make modern words appear more infrequent than they actually are, and obsolescent words more common than they are.    The moral is not that Google is evil or conservative or that culturomics is stupid or that the N-gram Viewer is fatally flawed.   I do think,  nonetheless, that we ought to be aware that the specific kinds of unintentional skewing that are being produced are ones that tend, in a conservative direction, to replicate the linguistic and cultural values of a century ago.  This problem is not going away, absent a systematic effort to eliminate reprints from some future N-gram dataset, and it may even be getting worse as electronic reprints become more and more common.  Stick to the pre-2000 data, though, just like they advise, and you’ll be in good shape.

Thanks to Julia Pope for her consultation and assistance on aspects of Google Books metadata and cataloging practices.

Milestones

Yesterday, my post, Cistercian number magic of the Boy Scouts, was the 200th post on Glossographia since its inception.

Today, around 8:45pm EDT (roughly 15 minutes ago as I write), some lucky visitor to the site rolled over the odometer, marking 100,000 views of the site all-time.

In honour of these milestones, here is a an actual milestone, from Thanjavur (Tanjore), Tamil Nadu, India, which features the numerals ’14’ and ‘8’ in both Tamil and Western numerals.

401px-Milestone_tanjore_old

Neolithic Chinese sign-systems: writing or not writing?

The Guardian just reported today on a find from Zhungqiao (near Shanghai) of artifacts bearing writing-like symbols that date back over 5,000 years.  If this were substantiated, this would take the history of Chinese writing back an additional millennium or more from the earliest attested ‘oracle-bones’ and other inscriptions of the Shang dynasty.

The article reports that the artifacts in question were excavated between 2003 and 2006, and the information is both slight and non-specific, and doesn’t link to any specific publication as of yet, so it’s difficult to know how, if at all, this relates to the host of other reports of writing or writing-like material from Chinese Neolithic sites (the Wikipedia page on Neolithic Chinese signs is quite extensive).    The signs from Jiahu are much older than those of the newly reported find, for instance.

I think that the difference that’s at question, and discussed in the Guardian piece, is the presence on some of these artifacts of series of several signs in a row, thus suggesting sentence-like structure rather than, say, ownership marks or clan emblems or just decoration, which is what most of the other Neolithic signs have been determined to be.    I have to say that, if the stone axe pictured in the article is representative of the new finds, then I’m dubious of the entire enterprise – those do not look, to me, to have a writing-like nature, and some of them may not be ‘signs’ at all.   I hate to be so negative, but the tendency to announce finds in the media that never come to anything in publication is so great that we should indeed be highly skeptical when such announcements are made in the absence of a published site report or article.

Cistercian number magic of the Boy Scouts

“You know that in ancient times religion, astronomy, medicine, and magic were all mixed up so that it was difficult to tell the beginning of one and the ending of the other and to-day the Gypsies, hoboes, free masons, astronomers, scientists, almanacs, and physicians still use some of the old magical emblems.  So there is no reason why the boys of to-day should be debarred from using such of the signs as may suit their games or occupations and we will crib for them the table of numerals from old John Angleus, the astrologer.  He learned them from the learned Jew, Even Ezra, and Even Ezra learned them from the ancient Egyptian sorcerers, so the story goes; but the reader may learn them from this book.” (Beard 1918: 91)

So begins the chapter, “Numerals of the Magic: Ancient System of Secret Numbers”, by Daniel Carter Beard in his 1918 volume The American boys’ book of signs, signals, and symbols, which you can download from Google Books for free.  Beard was one of the founders of the Sons of Daniel Boone in the early 20th century, which merged with the Boy Scouts of America (of which Beard was a key founder) in 1910 when that famous group was formed.  Beard wrote a number of popular books intended for boys in the Scouting movement, including this one.   Scouting books today do not, as a rule, make reference to esoteric Egyptian sorcery or Freemasonry or ‘John Angleus’ (who is Johannes Engel (1453-1512)) or ‘the learned Jew, Even Ezra’ (Abraham ibn Ezra (1089-1164)), or, for that matter, have a chapter on number magic at all.   At least, I never heard about it, and I was in Scouts for over a decade.  But we are fortunate that this one did, because it has a couple of real treasures inside, not previously recognized as such.

Let’s take the second one first.  It appears on p. 92 immediately following the passage I just quoted:

Beard18-p92
(Beard 1918: 92)

For those of you familiar with my book Numerical Notation, these are the numerals used primarily by Cistercian monks from the 13th – 15th centuries, and thereafter described in early modern numerology and astrology for several centuries, though largely at that point as an intellectual curiosity rather than a practical notation.    David King’s wonderfully detailed Ciphers of the Monks (King 2001), which is one of the few books at that price point (somewhere around $150, if I recall) that may be worth it, lists every example the author could find of these numerals, from medieval astrolabes to Belgian wine barrels to 20th-entury German nationalist texts.    It’s extremely comprehensive.  However, it does not mention Beard’s book – and why should it? What a bizarre place to find such a numerical system!   It’s what I describe as a ciphered-additive system, which is to say that there is no zero because none is needed: there is a distinct sign for each of 1-9, 10-90, 100-900, and 1000-9000.   The Cistercian numerals are a little anomalous typologically; another interpretation of them would be that they are positional, but use rotational rather than linear position – the signs for 9, 90, 900, and 9000 (e.g.) are rotations or flips of one another, so we could consider them the same sign (9) in four different orientations.     Zero is superfluous (thus not present) because unlike linear texts, there is no ‘gap’ to be accounted for by an empty place-value.

I became curious and tried to figure out why Beard attributed these to ‘Angleus’ and to ‘Even Ezra’.    Engel’s Astrological Optics was translated into English (1655) but contains no Cistercian numerals, and King doesn’t note him as using or depicting the system.  Similarly, ibn Ezra was not a known user of the system.   And I haven’t even been able to find any other source that attributes the system to those individuals; rather, it’s almost always Agrippa of Nettelsheim or Regiomontanus who are invoked in the scholarship.  We know that Beard was a Freemason, so he may have had access to some Masonic texts that said as much, but I can’t find any such reference, and King doesn’t mention any likely sources either, although he does note that many Masons (especially in France) were familiar with the Cistercian system.    So it’s not entirely clear where Beard learned about the system (although see below), and he’s got a lot of things mixed up in the account.

The other numerical treasure in Beard’s book is even more fascinating, although it appears in the previous chapter on codes and ciphers and is less prominent, on p. 85, the ‘tit-tat-toe’ numerals:

beard18-p85
(Beard 1918: 85)

So what we see here, again, is a ciphered-additive decimal system in which there is a ‘family resemblance’ between 9, 90, 900, and 9000 (and the other numbers so patterned), but no zero.  The signs are designed after their place in a hash / tic-tac-toe / octothorpe with the power indicated through ornamentation.  As a ciphered-additive system, it’s like the Cistercian numerals (although the signs are completely different) but instead of placing signs around a vertical staff, the signs are constructed into a box.  Note that the signs in each numeral-phrase are not strictly ordered, but are packed compactly in whatever way suits the resulting box aesthetically. This is one of the advantages of ciphered-additive systems that, if desired, for cryptographic purposes or for any other reason, the signs can be re-ordered without loss of numerical meaning.   But I know of no system quite like this, where numerals are arranged in a box-like shape, or where there is such a novel means of forming individual signs.

Beard is explicit that this system is newly designed: “The tit-tat-toe system of numerals here shown for the first time is entirely new and possesses the advantage of being susceptible of combinations up to four figures which suggests nothing to the uninitiated but a sort of Japanese form of decoration”  (Beard 1918: 84).   He claims that the alternate name ‘Cabala’ is just another name for the tit-tat-toe, which is a highly dubious claim, but he is clearly trying to invoke a connection between his newly-developed system and Jewish mysticism – in the hope that Boy Scouts will use it as a numerical code.  Ciphered-additive numerals are rare enough in the modern era – most of the systems are obsolescent at best.  So it’s fascinating to see a twentieth-century system right at the moment of its development.   It’s also fascinating to see how mystical, spiritual, and numerological knowledge from early-modern authors is incorporated into a manual for Boy Scouts and recommended for use in cryptography.

We’re not quite done, though.  Based on some of the (otherwise uncited) quotations in Beard’s book, I concluded that he was taking some of his ‘insight’ about the ‘Cabala’ from L.W. De Laurence’s Great Book of Magical Art (1915), which was a popular American book of spiritualism and Oriental mysticism at the time.  And, looking into de Laurence’s book, lo and behold, what did I find?

(De Laurence 1915: 174)
(De Laurence 1915: 174)

De Laurence, whose work is also not noted by King, gives a more standard attribution than does Beard for what we now know to be the Cistercian numerals: he attributes them to the ‘Chaldeans’, which is a very common descriptor for the system and is even found in the scholarly literature.  He doesn’t mention Angelus or Even Ezra or any other of the medieval and early modern authors who use the system, so it’s still a mystery how Beard made that attribution.  But, given that there really are not a lot of texts that discuss this system at all, I suggest that Beard encountered them through De Laurence and possibly confounded their origin with some other understandings he had picked up along the way, possibly through Masonic writings.

It’s not every day that I discover a new numerical notation system, and it’s great to do that, even when it’s one that  seems to have been developed once but never adopted more widely.   So it was neat to find the ‘tit-tat-toe’ system, even if it never appeared anywhere else.  But I also found it fascinating to track the transmission of the much more widespread (but still under-appreciated) Cistercian numerals through their roundabout path to a Scouting manual for boys.    As King’s book amply demonstrates, the system has a tendency to show up in the oddest places, so perhaps we should (ahem) ‘be prepared’ to find them anywhere.

Beard, Daniel Carter. 1918. The American boys’ book of signs, signals and symbols. Philadelphia: Lippincott.

De Laurence, L. W. 1915. The great book of magical art, Hindu magic and East Indian occultism. Chicago, Ill., U. S. A.: De Laurence Co.

King, David A. 2001. The ciphers of the monks: a forgotten number-notation of the Middle Ages. Stuttgart: F. Steiner.

News roundup

I’m going to try to post groupings of short news pieces on a weekly or biweekly basis.     You may have already seen some of these if you follow me on Twitter.

Serendipitously, shortly after I posted about numerical copyediting and told my story about Indian English numerals, Toyota  announced that it is abandoning its longstanding practice of using Japanese English numerals in favour of international (read: American) ‘million’ and ‘billion’.   Japanese-influenced English uses multiples of 10,000 and 10 million, like the Japanese language, so that what Americans write as one million is “100 ten-thousands” instead.

From the Independent: a collection of some of the most highbrow jokes in the world.  #8 and #19 are my favourites, for reasons that will be evident, but there are lots of language-related ones.    Although, to be semantically particular, I think that many of these jokes are ‘nerdish’ rather than ‘highbrow’, strictly speaking.

The Globe and Mail has a great interview with Christine Schreyer who is a linguistic anthropologist at the University of British Columbia-Okanagan, and who was recently employed as a consultant to the film Man of Steel to develop a constructed language for Kryptonian.  I haven’t seen the movie but I wonder whether the writing system the producers employed was this one:

5884535

Over at the Doing Good Science blog at Scientific American, there’s a very interesting post on disrespect and sexism in science journalism.  Here’s a tip: when writing about a conference, it’s not cool to refer to women (but not men) by first-name only, and don’t go out of your way to mention the physical appearance of women (but not men).    Has useful tips for how to respond to such incidents when they happen.

Although I object to describing it as ‘tuition free’, it’s very interesting that the Oregon legislature has unanimously passed a law to allow students to initially attend state institutions without paying tuition, and then pay 3% of their earnings for the next 24 years after graduation.  This has lots of potential problems but has worked well in Australia and elsewhere.  Perhaps the most surprising is that Democrats and Republicans in the legislature could unanimously agree on anything!

Ronald Kephart, who also blogs at The Cranky Linguist, has a nice pedagogical essay at Anthropology News on Illustrating science through language.    Linguistic anthropology sometimes gets a bum rap as being all mushy, and Kephart shows how to add rigour and critical analysis to students’ toolkits when thinking about language and culture.

There’s an interesting piece on so-called ‘helicopter parents’ over at CNN.com, whose over-attentiveness to their adult children in academia or in employment causes negative repercussions. I have to say – and maybe this is a function of where I work – that while I have had one or two parents call or come for a meeting regarding their child’s graduate education, I have not found this to be a big problem at Wayne State.

Over at Tenth Letter of the Alphabet, there’s a very interesting post for typography geeks and SF geeks (highly overlapping sets, to be sure) on the history of the STAR WARS logo.  Over the past week, I’ve been watching Episodes IV and V with my son, who is eight and hasn’t seen them before (we’re watching them in Machete Order), so it’s been on my mind.

Finally, there’s a thoughtful (if somewhat gloomy) essay at the Chronicle of Higher Education on attrition in PhD programs.  As the graduate director of a mid-sized social sciences program, I often have reason to think about this.    Just about the only thing everyone agrees on is that 0% attrition is too low and 100% is too high – but what is appropriate?   The essay led me to the PhD Completion Project, which has a ton of interesting quantitative information on PhD completion and attrition rates across multiple institutions, along with policy recommendations.