Wiktionary:Frequency lists

Counting words and lemmas: The following frequency lists count distinct orthographic words, including inflected forms and sometimes capitalised spellings (like they are used at the beginning of sentences). For example, the verb "to be" is represented by the conjugations "is", "are", "were", etc.

English

TV and movie scripts

Most common words in TV and movie scripts: Here are frequency lists comparable to the Gutenberg ones, but based on 29,213,800 words from TV and movie scripts and transcripts.

Here's a fuller explanation of how the list was generated and its limitations: Wiktionary:Frequency lists/TV/2006/explanation.

Here are the top hundred words (from TV scripts) in alphabetical order:

a · about · all · and · are · as · at · back · be · because · been · but · can · can't · come · could · did · didn't · do · don't · for · from · get · go · going · good · got · had · have · he · her · here · he's · hey · him · his · how · I · if · I'll · I'm · in · is · it · it's · just · know · like · look · me · mean · my · no · not · now · of · oh · OK · okay · on · one · or · out · really · right · say · see · she · so · some · something · tell · that · that's · the · then · there · they · think · this · time · to · up · want · was · we · well · were · what · when · who · why · will · with · would · yeah · yes · you · your · you're

Here they are in frequency order:

1-1000 · 1001-2000 · 2001-3000 · 3001-4000 · 4001-5000 · 5001-6000 · 6001-7000 · 7001-8000 · 8001-9000 · 9001-10000
Top 1,000 words cover 85.5% of all words (24,981,922/29,213,800).
Top 10,000 words cover 97.2% of all words (28,398,152/29,213,800).

From the 10,000th to the 40,000th :

10001-12000 · 12001-14000 · 14001-16000 · 16001-18000 · 18001-20000 · 20001-22000 · 22001-24000 · 24001-26000 · 26001-28000 · 28001-30000 · 30001-32000 · 32001-34000 · 34001-36000 · 36001-38000 · 38001-40000
40001-41284 (the dregs that were tied for the final place)

That'll probably be it. It's a third of all the unique words. The rest were used 5 or fewer times each.

Specific TV Series Dictionaries

Project Gutenberg

Most common words in Project Gutenberg:

These lists are the most frequent words, when performing a simple, straight (obvious) frequency count of all the books found on Project Gutenberg. The list of books was downloaded in July 2005, and " rsynced" monthly thereafter. These are mostly English words, with some other languages finding representation to a lesser extent. Many Project Gutenberg books are scanned once their copyright expires, typically book editions published before 1923, so the language does not necessarily always represent current usage. For example, " thy" is listed as the 280th most common word. Also, with 24,000+ books, the text of the boilerplate warning for Project Gutenberg appears on each of them.

Here are the top 100 words from Project Gutenberg texts in alphabetical order:

These wikified terms can be copied to other language wiktionaries; this is what they are intended for. If you do, please add an interwiki link onto the page here.

Frequency lists as of 2006-04-16:

Frequency lists as of 2005-10-10:

1001-2000 · 2001-3000 · 3001-4000 · 4001-5000 · 5001-6000 · 6001-7000 · 7001-8000 · 8001-9000 · 9001-10000

  • More to come...

Frequency lists as of 2005-08-16:

From the straight frequency count, the current copy of Wiktionary was then removed from that list. Even entries that only have a redirect were removed.

With somewhat different filtering/selection criteria:

The location of the latest version:

Contemporary fiction

The 2,000 most common words in contemporary fiction can be found here:

The 2,000 most common words in contemporary fiction can be found here divided into 60 subject categories.

This lumps regular lemmas of the same word together, unlike most of these lists.

Contemporary poetry

The 2,000 most common words in contemporary poetry can be found here:

Another lemma-based list.

Top English words lists

Word families

Other Languages