Mark Liberman at Language Log posted a nice heads-up today to this article in the Economist, about linguists’ use of the internet to collect linguistic data. There is far more data there, the article points out, than in any existing linguistic corpus, and it is fairly easily searchable, at least if you know exactly what you’re looking for and don’t want to search by something abstract like, say, parts of speech. But…
The web still has its drawbacks. Most of it is in English, limiting its use for other languages (although Dr Resnik is working on a Chinese version of the LSE). And it is mostly written, not spoken, making it tougher to gauge people’s spontaneous use. But since much web content is written by non-professional writers, it more clearly represents informal and spoken English than a corpus such as the North American News Text Corpus does.
There are so many interesting points here. I’ll just follow two: that of internet content as written/spoken, and that of internet content as a viable source for usage evidence.
In a course last semester, we read one paper (on optimality theory, if you must know) which used text from web pages to harvest examples of syntactic alternation for the dative (e.g., Sara gave him the ball v. Sara gave the ball to him–not that it matters here!). In the paper, the authors claim (in the abstract) that their corpus is of “spoken English.” But it’s retrieved from text, which is seen and read, visually! On a computer screen! Just like this blog, which is surely written!!
This of course spurred a class discussion of how much the web texts really did count as examples of spontaneous, spoken English vs. how much they counted as meditated, written English. While I agree with the Economist article that web content is likely to be more speech-like than some text-based corpora, I’m wary of the reasoning that that’s because it’s written by “non-professional writers.” It has more to do with a website’s genre, purpose, readership, etc. If I wanted to elicit a speech sample from an ordinary person, regardless of whether they were a professional writer, I wouldn’t ask them to give me something that they had written; I would have a conversation with them. Right? Don’t get me wrong, I think the internet is an extremely valuable source of information for linguistic usage (among gadzooks of other things, and maybe even foremost among them all, since validity of the content matters not to the linguist), but let’s not kid ourselves: even when people are writing online, they’re writing. It’s like IM: IM is a pretty good approximation to speech, but it’s not speech. “Speech” on IM is influenced by the medium. So is communication on the internet. I don’t think we can ignore the modalities; there are too many of them now, and we negotiate them all differently. So much of linguistics is focused on speech and not writing, which is a critical distinction - and one we need to continue to refine and build on as new media proliferate.
[OK, OK, you're saying: we get it. The internet is not sound waves is not a pen and paper is not the phone. But this is important! At a theoretical level it's crucial. Sure it's okay to use samples from the web--but don't claim that they're of spontaneous, natural, spoken language, because they're not.]
That said, the internet is, as I also already said (is this post redundant yet?), an extremely valuable resource for linguists, and also for lexicographers. I wrote a term paper last semester about the potential of online dictionaries (if anyone cares to actually read it, I’m happy to send it on) as sites of linguistic democratization. What’s curious to me about dictionaries is that they claim to represent spoken language (and of course written language), yet they only accept as evidence for usage written, meaning published, examples. One would think, hey, the internet’s writing! Internet’s perfect, because it’s written but it’s not necessarily written by “professionals.” But noooo. The OED won’t accept internet sources yet. Websites are too “ephemeral,” they say; authors can’t be contacted, the page might not be there tomorrow. But that’s exactly why the internet is so valuable: it offers that gray area of usage between speaking and writing and, moreover, it offers usage from the non-elite.
Maybe it all makes sense, according to the following syllogistic logic, which is undoubtedly fallacious, as it was formulated in haste to complete this too-long post:
The OED accepts written sources.
Material on the internet is written.
The OED does not accept internet sources.
So maybe…
Material on the internet is spoken.
Linguists use spoken material.
Linguists can use the internet!