Iterative Conlang Design, or, Build One to Throw Away

My Conlang #13, tentatively called {säb zjed'a}, is primarily interesting for the lexicon development methodology I'm using for it. I started by generating a set of phonologically redundant words using Perl scripts I had written for that purpose, assigned meanings to a few dozen of them to get a small starter lexicon, and started writing some sentences in the language. I had at first only a vague notion of its grammar (VSO, prepositional, ergative, mostly isolating, but with semantic category and part of speech marking). I used frequency analyses on my Toki Pona corpus and my gjâ-zym-byn corpus, and the list of "semantic primes" from the Wikipedia article on "Natural semantic metalanguage" as sources for ideas about the seed lexicon, along with Rick Harrison's Univeral Language Dictionary. Any concept I could readily figure out how to represent with a phrase rather than a root word (there are no compounds in this language, at least in phase 1), I did. I have a more radical use of opposite-derivation than Esperanto, for instance (except the "mal-" morpheme is a preposition rather than a prefix, forming phrases like "opposite-of hot" = "cold").

So I went on writing sample sentences, coining new words as needed when I couldn't figure out a way to express ideas with existing words. Many of them came from the Conlang Test Sentences.

My plan is to go on translating the Conlang Test Sentences, and write at least one example sentence for most or all of the words in the lexicon; then translate a couple of longer texts (the Tower of Babel story, for instance) — still coining new words only when I can't figure out how to make do with existing ones.

When I've got a corpus of a few thousand words (or perhaps sooner), I'll do a frequency analysis on it — not just of the relative frequency of words, but of two- and three-word sequences. I may also use my experience with the language so far to rule out some consonant clusters that I thought feasible at first, but have proven too difficult to pronounce consistently, in modifying the phonology format files and regenerating a new list of root words.

Then I'll relex the language, applying the following rules:

If a word occurs more commonly than another word in the corpus, it should be at least equally short in the relex.
Any sequence of two or more words that occurs often enough in the corpus will get its own word in the relex which will be at least equally short as other words and phrases of equal or lesser frequency.
The above rules may need to be bent to allow part-of-speech marking. For instance, if the 100th most common word or phrase in the corpus is a noun and the 101st most common is a verb, and I've run out of monosyllabic noun roots but have some monosyllabic verb roots left, then a slightly more common word will get a word that's longer than a less common word. But if this happens a lot I'll adjust the part-of-speech marking system to allow more roots for the kinds of words that occur more often.

Then I'll convert the corpus to the new lexicon, spend some time familiarizing myself with the relex, and write and translate more text. When the corpus gets larger and more representative, I'll do another frequency analysis and relex again.

Obviously I'm not going to make a serious attempt to become fluent in this language until it's gone through several such relexes (each probably less drastic than the last).

Based on advice from David J. Peterson, I'm counting the lengths of words by a weighting algorithm that assigns weights to each kind of phoneme in the word. Vowels count the most toward a word's length, then nasals, then liquids, then fricatives, and least of all plosives; but the exact weights assigned are still uncertain. My draft script which I used to sort the generated words by length before assigning meanings to them in the seed lexicon used these weights:

my $fricative_weight = 0.25;
my $nasal_weight = 0.5;
my $liquid_weight = 0.375;
my $plosive_weight = 0.1;
my $vowel_weight = 1;

But the weighting will probably change, for a less drastic difference between nasals and liquids, for instance.

Ensuring that the corpus is representative is going to be tricky. The first few hundred words of it are made up of sample sentences that are probably not very representative of continuous text in any language. I could try to duplicate the Brown Corpus in miniature, i.e. have texts of the same genres in roughly similar proportions; but that seems like a daunting task. Maybe I should simply exclude the grammar example sentences from the corpus analysis, once I have enough connected text (narratives, articles, etc.)?

I'm not sure what the the criterion should be for a phrase occuring "often enough" in the corpus to deserve its own root word. My rule of thumb is that if a phrase occurs as more than 0.2% of the corpus it probably deserves its own word, but after doing an actual frequency analysis I may set the threshold lower than that.

As the lexicon is remade for each phase of the language's development, the grammar may change as well. So I'll document the grammar and phonology in separate documents for each phase.

April 2006 CONLANG thread following my original post on this design
Phase 1 phonology
Phase 1 grammar
Phase 1 corpus: HTML / plain text
Phase 1 lexicon: HTML / plain text (ASCII version includes many possible words not yet assigned meanings, category tags, weighted length value for root words...)
scripts.zip, a file of the Perl scripts used for processing the lexicon and corpus into HTML files, etc. (for files used to generate the vocabulary, see "Generating phonologically redundant vocabulary for an engineered constructed language")
Main conlang page

Last updated July 2009