The Cheese Pages - Cheese Science

 

cheese!            

[this page is still under construction, because Manuel is too lazy to contribute to the project.]



1. Searching the dictionary

After realizing how time-consuming it it to type each word a hundred times in google with different spellings (and writing down the results on paper) we (Alexandre) wrote a script to search every word in the dictionary with each number of repetitive letters. This was done for each vowel in each word (consonants have been found to be generally less interesting), so for "cheese" we tried "cheeese, cheeeese" and so on, as well as "cheesee", "cheeseee" up to 100 e's. For each combination of correctly spelled word and substituted vowel, a data file was created.



2. Data filtering

Let i denote the number of letters in place of the original vowel (so for both "cheese" and "fish", i=1, but for "helloo" i=2) and let n denote the number of hits for this "word".  We then define the so-called "interest parameter" p, measuring (quite arbitrarily, but effectively) the frequency with which long spellings with high i occur, compared to the correct spelling. It is defined by p = sum_{i=3}^{i=max}(i*n(i)/n(0)), i.e. the sum of ratios of each long spelling to the original one, weighted by n. For reasons of (unintentional) misspellings, we start at i=3.

We can now select the word-letter combinations giving the highest value of p; this will give us a reasonably robust sample of popular long spellings on the web. On the right is a list of some of the most popular words.

In general, word-letter combinations with low p values will not have absorption features.

Examples of words with high values of p

  • heeeeeeeeeeeel
  • ooooooooooh
  • hooooooooooo
  • sooooooooooo
  • gooooooooood
  • haaaaaaaaaaaa
  • hellooooooooo
  • heeeeeeeeeeey
  • booooooooobs
  • doooooooooom
  • hoooooooooog
  • booooooooring


3. Data analysis


To find out why cheese absorptions, like the one shown here, occur, we first quantify the distribution of hits as a function of letters with a set of characteristic parameters. To this end we fit a power law to the "continuum" -- excluding the absorption, if present, and the correct spelling (which for obvious reasons gives many more hits). These are some of the parameters we focus on:


  1. The amplitude c of the power law, the latter having the form n=c*i^a
  2. The slope a of the power law fit
  3. The central position i_abs of the absorption feature
  4. The width of the absorption feature
  5. The number of hits for the original word
  6. The number of letters in the original word
  7. The position of the substituted letter in the original word (for "cheese", for example, it can be 3 or 4, then we always take the lowest value for consistency).


smooth



5. Results


Taking into account the 200 words with the highest values of the "interest parameter" p, we find a tight correlation of the power law amplitude c and the position i_abs of the absorption line, as shown on the right. Taking more words into account does not improve the statistics, since the absorption lines become less well-defined as the number of repetitive spellings goes down, thereby increasing the chance of a erroneous fit to the absorption feature in our algorithm.

The solid line is the best-fit powerlaw to this distribution.

In short, for words where long spellings are very popular, such as "cheese" or "smooth", absorption features occur at high numbers of extra vowels, i.e. 20-40. For words where long spellings are reasonably popular, absorption features still occur, but at lower numbers of extra vowels.
correlation of amplitude and aborption



6. Conclusions