|
| [this page is still
under construction, because Manuel is too
lazy to contribute to the project.] |
|
| 1. Searching the dictionary |
|
| After realizing how
time-consuming it it to type each word a hundred times in google with
different spellings (and writing down the results on paper) we
(Alexandre) wrote a script to search every word in the dictionary with
each number of repetitive letters. This was done for each vowel in each
word (consonants have been found to be generally less interesting), so
for "cheese" we tried "cheeese, cheeeese" and so on, as well as
"cheesee", "cheeseee" up to 100 e's. For each combination of correctly
spelled word and substituted vowel, a data file was created. |
|
| 2.
Data filtering |
|
| Let i denote the number of letters in place of the original vowel (so
for both "cheese" and "fish", i=1,
but for "helloo" i=2) and let
n denote the number of
hits for this "word". We then define the so-called "interest
parameter" p, measuring
(quite arbitrarily, but
effectively) the frequency with which long spellings with high i occur, compared to the correct
spelling. It is defined by p
= sum_{i=3}^{i=max}(i*n(i)/n(0)),
i.e. the sum of ratios of each long spelling to the original one,
weighted by n. For reasons of
(unintentional) misspellings, we start at i=3. We can now select the word-letter combinations giving the highest value of p; this will give us a reasonably robust sample of popular long spellings on the web. On the right is a list of some of the most popular words. In general, word-letter combinations with low p values will not have absorption features. |
Examples
of words with high values of p
|
|
|
|
| 3.
Data analysis |
|
To find out why
cheese absorptions, like the one shown here, occur, we first quantify
the distribution of hits as a function of letters with a set of
characteristic parameters. To this end we fit a power law to the
"continuum" -- excluding the absorption, if present, and the correct
spelling (which for obvious reasons gives many more hits). These are
some of the parameters we focus on:
|
![]() |
|
|
|
| 5.
Results |
|
| Taking into account the 200
words with the highest values of the "interest parameter" p, we find a tight correlation of
the power law amplitude c and
the position i_abs of the
absorption line, as shown on the right. Taking more words into account
does not improve the statistics, since the absorption lines become less
well-defined as the number of repetitive spellings goes down, thereby
increasing the chance of a erroneous fit to the absorption feature in
our algorithm. The solid line is the best-fit powerlaw to this distribution. In short, for words where long spellings are very popular, such as "cheese" or "smooth", absorption features occur at high numbers of extra vowels, i.e. 20-40. For words where long spellings are reasonably popular, absorption features still occur, but at lower numbers of extra vowels. |
![]() |
|
|
|
| 6.
Conclusions |
|