Location, location, location

Surrealist Hiptop pix!

“He said, she said” — do men and women use different words when they write?

Do men and women use words in different ways? A group of Israeli artificial-intelligence experts think so. They crunched a bunch of English texts by men and women, both fiction and nonfiction, and looked for interesting patterns. The results? In this paper, they argue that it’s possible to figure out the gender of an author merely by paying attention to a few everyday words — and their guesses are accurate 80 per cent of the time, or higher.

For example, they discovered that in fiction, men are more likely than women to use the words a, the, and as; meanwhile, women are more likely than men to use the words she, for, with, and not. In nonfiction, men are more likely than women to use that and one. Women, however, are more likely than men to use for, with, not, and, and in.

Here’s another weird data point: Men use the pronoun he with roughly the same frequency as women, but women use the total set of all other pronouns — he, she, they, etc. — than men.

Interestingly, there are also some differences between the way everyone uses language in fiction and nonfiction. All authors — both male and female — used pronouns and negation more in fiction than nonfiction.

Did this technique make any mistakes? Yep. The professors crunched 920 English-language texts, and misclassified 12 texts, which were:

Possession, by A. S. Byatt
The Remains of the Day, by Kazuo Ishiguro
Now We Are Thirty-Somethings, by Charles Jennings
Now Then Davos, by Martin Wiley, David Harmer, and Ian McMillan
The Seige of Krishnapur, by J. G. Farrell
A Landing on the Sun, by Michael Frayne

Thank you for having me, by Maureen Lipman
A Crowd is not Company, by Robert Kee
T.S. Eliot: A Friendship, by Frederick Tomlin
Walking on Water, by Andy Martin
Unpublished Letters and manuscripts, by an Unlisted Female Author
Falling for Love: How Teenaged Mothers Talk, by Sue Sharp

As the scientists note, of the six misclassified non-fiction documents, all are biographical or diary-like. That’s intriguing, insofar as one might expect that people would write most “like” their gender when they’re writing about personal experience. Meanwhile, of the six misclassified fiction documents, all are by men, except for Possession. What’s up with that? Are these men writing “like” women? (Heh — maybe this is a subterranean reason why Jonathan Franzen freaked out so badly when Oprah picked The Corrections for her book club.) On the other hand, decades of gender theory has ably pointed out that gender is an insanely slippery thing: Men can so often act “like” women, and vice versa, that the whole idea of drawing hard lines around what’s male and what’s female is sort of bonkers. It’d be interesting to replicate this study with texts solely by gay men, lesbians, or transgendered people — the folks who often mess directly with society’s concepts of male and female roles — to see if it generates any different results.

The scientists don’t offer any theories as to why they these differences exist. But for me, what’s most interesting is that the words they’re focussing on — the ones that create the “fingerprint” identifying the document — are very common, throwaway words like at, she, but, or that. You wouldn’t expect such simple words to be so important in determining meaning.

Actually, almost all artificial-intelligence research into language backs this up. A decade ago, Thomas Landauer pioneered Latent Semantic Analysis — a way of automatically figuring out the “content” of a piece of writing by looking at a fingerprint of its words. Again, you’d expect that the most “important” words in a document, in terms of identifying what it’s about, would be the ones most individually freighted with meaning. For example, if you looked at this blog entry, you might think the words artificial, intelligence, gender, fiction, nonfiction, men and women would be significant. But what Landauer found is that you could strip out those big-meaning words, leaving all the other stuff behind — the buts, ands, ors, whiches, etc. — and you could still figure out what the document was about. Spooky, eh?

It’s also like the epiphany of Donald Foster — the professor who analyzes word occurrence to determine the author of texts that have been left anonymous by history. He’s the one, you may recall, who figured out that Joe Klein wrote the book Primary Colors. As he noted in his book on the subject, the words that are most revealing of one’s identity are not the high-meaning words — because those are the ones we pay attention to, and sculpt like clay. The ones that reveal our identity are the low-meaning ones — the ifs, the ands, the buts — because we use them unconsciously. They aren’t as subject to our will, and thus are a lot harder to obfuscate.

Maybe I should just stop writing blog entries in full sentences. I’ll just use pronouns and conjunctions.

“I in and the but the they or and.”

(Thanks to Rachel for pointing out this study to me!)

blog comments powered by Disqus

Search This Site


I'm Clive Thompson, the author of Smarter Than You Think: How Technology is Changing Our Minds for the Better (Penguin Press). You can order the book now at Amazon, Barnes and Noble, Powells, Indiebound, or through your local bookstore! I'm also a contributing writer for the New York Times Magazine and a columnist for Wired magazine. Email is here or ping me via the antiquated form of AOL IM (pomeranian99).

More of Me


Recent Comments

Collision Detection: A Blog by Clive Thompson