Avoiding bias in comparative creole studies: Stratification by lexifier and substrate

One major research question in creole studies has been whether the social/diachronic circumstances of the creolizaton processes are unique, and if so, whether this uniqueness of the evolution of creoles also leads to unique structural changes, which are reflected in a unique structural profile. Some creolists have claimed that indeed the answer to both questions is yes, e.g. Bickerton (1981), McWhorter (2001), and more recently Peter Bakker and Aymeric Daval-Markussen. But these authors have generally overlooked that cross-creole generalizations require representative sampling, especially when working quantitatively. Sampling for genealogical and areal control has been a much discussed topic within world-wide typology, but not yet in comparative creolistics. In all available comparative creoles studies, European-based Atlantic creoles are strongly overrepresented, so that typical features of these languages are taken as “pan-creole” features, e.g. serial verbs, double-object constructions, or obligatory use of overt pronominal subjects. But many of these Atlantic creoles have the same genealogical/areal profile, i.e. European (lexifier) + Macro-Sudan (substrate). I therefore propose a new sampling method that controls for genealogical/areal relatedness of both the substrate and the lexifier, which I call “bi-clan” control (where “clan” is a cover term for linguistic families and convergence areas).


Introduction 1
One major research question in creole studies has been whether the social/diachronic circumstances of the creolizaton processes are unique, and if so, whether this uniqueness of the evolution of creoles also leads to unique structural changes, which are reflected in a unique structural profile. Some creolists have claimed that indeed the answer to both questions is yes, e.g. Bickerton (1981), McWhorter (2001), and most recently Bakker et al. (2011) and Daval-Markussen (2014).
To demonstrate the unique structural profile, these creolists have proposed that creoles share a set of "pan-creole" features. A unique structural profile implies that creoles are internally uniform through their pan-creole features and that they are externally distinctive with respect to non-creoles world-wide. But to show that creoles uniformly share pan-creole features, one does not only need to find a set of such features, but one also needs to examine a representative sample of creoles, i.e. a sample with historically and areally maximally independent languages. Sampling has been a very much discussed topic within typology (e.g. Dryer 1989Dryer , 1992Rijkhoff & Bakker 1997;Perkins 2001;Bickel 2008;D. Bakker 2011), where it has been widely recognized that a biased picture can result if language samples contain languages that are not independent from each other, either because they are descended from a common ancestor or are neighbouring languages that may have influenced each other. Thus, one needs to control for genealogical and areal bias when looking for universal features in languages world-wide. However, in studies that look for creole universals, we still lack such a discussion of what a representative sample of creoles should look like.
The present paper is an attempt to fill that gap by suggesting a new method for sampling in contact linguistics, which involves introducing the notion of BI-CLAN (a set of languages that share the same lexifier clan and substrate clan, where clan is a cover term for families and convergence areas).
The paper is organized as follows: In §2, I briefly look at existing samples in comparative creole studies and show that they are biased towards one areal group, Atlantic creoles. In §3, I give an overview of the APiCS database, before I introduce the notion of bi-clan as a way of stratifying samples of creoles in §4. In §5, I look at the implications of bi-clan sampling for pan-creole features and creole universals. In § §6-7, various structural features in (pidgin)creoles will be checked against the bi-clan distribution, and §8 concludes the paper.
The main claim of this article is the following: For far too long creolists have concentrated on the analysis of a specific areal type of (pidgin)creole languages, namely Atlantic (pidgin)creoles, and have extrapolated from this narrow profile to creole languages in general (e.g. Bickerton 1981). But now that we have an important new source of systematic comparable data in the Atlas of Pidgin and Creole Language Structures (Michaelis et al. 2013a), we are in a position to assess the impact of the oversampling of the Atlantic creoles and to introduce a new sampling method via bi-clans. This is a prerequisite of the discovery of true creole universals.
Many of these overrepresented Atlantic creole languages happen to have the same contributor profile: European lexifiers and Sub-Saharan West African substrate(s). Table   1 shows some of the most recent comparative creole studies, the number of pidgins and creoles analyzed, and the percentage of Atlantic pidgins and creoles out of these languages. show areal patterns of convergence, which led Güldemann 2010 to speak of the so-called "Macro-Sudan belt" (see Güldemann 2010, zone III in Figure 1). Güldemann shows a converging feature profile which cuts across various language families. These converging features allow him to propose a core zone (brown colour) and peripheral zones of this large Sub-Saharan macro area (orange and yellow colours).  Alleyne 1980, Boretzky 1983, Holm 1988, and Parkvall 2000 on African substrate languages in Atlantic creoles).
The idea is now that the languages of these different families share certain linguistic features, presumably due to long-standing contact. This view on West and Central African languages implies that even though potential substrates of Atlantic creoles belong to different substrate families, they may still show convergent structures.
Therefore, speakers of these different African languages may have initiated similar linguistic changes in the different language contact situations with European speaking colonists during creolization processes in Africa and the Caribbean leading to similar outcomes in the resulting creoles.
Once one is aware of the Atlantic bias in all available samples of contact languages, can one claim something about the typological profile of creole languages in general on the basis of this sample? My answer is clearly no. If one wants to generalize over the class of creoles as such, the first step is to better balance one's sample, and this is in my view possible even without collecting new data.

The APiCS database
The present paper is based on the large-scale comparative database of pidgins and creoles,

The problem of bias in language typology
Typologists have for some time been aware that sample bias may be a serious problem (e.g. Bell 1978;Dryer 1989;Rijkhoff & Bakker 1998;Perkins 2001). If one considers data from a range of languages that are not historically independent of each other, then one may get a skewed picture, even if one looks at a large number of languages. For example, if one's sample has many languages from Eurasia, one may wrongly conclude that the order of adjective and noun correlates with the order of possessor and noun, whereas this is in fact not the case (Dryer 1989). According to Rijkhoff & Bakker (1998: 264-265), there are basically two kinds of samples, variety samples (which display the greatest possible variety) and probability samples (which are designed to be quantitatively representative of the entire population). Variety samples are most suitable for exploratory research, when little is known about the phenomenon. By contrast, when one is interested in any kind of quantitative evaluation, one needs a probability sample (cf. also Bickel 2008: 222). This also applies to comparative creole studies that make quantitative statements with universal scope.
In language typology, genealogical bias and areal bias are the best-known kinds of sample bias, i.e. too many languages from a well-described family (e.g. Indo-European) are chosen, or too many languages from a well-described area (e.g. Europe). Such biases can be avoided by stratification, i.e. by creating mutually exclusive subgroups of languages (families or areas) which have equal status and are the basis for the selection of languages. Since the great majority of universals are statistical trends rather than exceptionless generalizations, a stratified world-wide sample is a necessary ingredient of any large-scale study that makes universal claims. There are of course many practical problems (such as determining the right families, and determining areas within which contact-induced convergence has taken place, cf. Song 2001: §1.5.3), but there is no doubt that stratified sampling is the least that one needs to support universal claims. 5 If one is interested in universal features of creole languages, one needs a stratified sampling method, too, but there are two possible sources of bias: from the substrate and the lexifier. Therefore, I would like to propose a sampling method that controls for genealogical and areal relatedness of both the substrate(s) and the lexifier, what I call BI-CLAN SAMPLING. A CLAN 6 is a language or a family or a linguistic area, and a BI-CLAN is a combination of a lexifier clan and a substrate/adstrate clan 7 . For example, the lexifier clan "English" combined with the substrate clan "Macro-Sudan" gives rise to the bi-clan "English/Macro-Sudan". Nigerian Pidgin, Jamaican and Saramaccan are for instance members of this bi-clan. The lexifier clan "Portuguese" combined with the substrate/adstrate clan "Indic" constitutes the bi-clan "Portuguese/Indic". Languages that belong to this bi-clan are Korlai, Diu Indo-Portuguese and Sri Lanka Portuguese. While we often know very well which lexifier is at the base of a given creole, the identification of the relevant substrates is a much more difficult matter. Therefore, we have the option of lumping different entities into a clan: A clan can either be a single language (e.g. English or French), a family (e.g. Indic, Malay) or a linguistic area (e.g. Macro-Sudan). The important issue here is that we try to keep potentially historically related creoles in the same bi-clan, whereas historically unrelated creoles should be in different bi-clans.
I interpreted Güldemann's Macro-Sudan belt (see above §2) narrowly and only took the core families of the Macro-Sudan belt to be part of the clan "Macro-Sudan", most importantly Mande, Kru, Gur, Kwa, Benue-Congo (except Narrow Bantu), whereas the families in the periphery (Atlantic, Ijoid, Narrow Bantu, and Nilotic) each make up their own clan, giving us bi-clans such as Dutch/Ijoid (with its member Negerhollands),

English/Bantu (with its members Pichi and Cameroon Pidgin English), or
Portuguese/Atlantic (with its members Cape Verdean creole varieties, Casamancese Creole and Guinea-Bissau Kriyol). Note that the term "Atlantic" here refers to a specific language family of West Africa (with e.g. Wolof and Balanta, see also footnote 3).
The 76 APiCS languages fall into 34 bi-clans, out of which 20 are represented by only one language. Many pidgins and all mixed languages in the sample happen to constitute a bi-clan of their own, as their areal/genealogical profile is unique. For example, Chinese Pidgin Russian is the only member of the bi-clan Russian/Sinitic, the mixed language Gurindji Kriol belongs to the bi-clan Gurindji (a language of Northern Australia) + Kriol (a creole language which arose from the contact between English and languages of North Australia), and Media Lengua is in the bi-clan Spanish+Quechua 8 .
In the present paper, I will concentrate on the 59 creoles in APiCS, whose bi-clan distribution is shown in Table 2. The granularity and the classification of the proposed bi-clans is open to discussion 10 . But the present approach should be taken as a first attempt to do justice to the different genealogical/areal linguistic profiles of creoles and at the same time to reduce the impact of typologically uniform languages of the same bi-clan, in order to achieve the ultimate goal, namely to assess potential universals in creole languages.
I will now turn to the discussion of various structural features in the context of the the bi-clan distributions.

Implications of bi-clan sampling for pan-creole features and creole universals
In the next section ( §6), I will examine various grammatical features and I will discuss their cross-creole distribution in APiCS. One of the leading questions will be whether a given feature is wide-spread enough among the different creole languages so that we can call it a pan-creole feature. The bi-clan sampling will help us to address this question.
For a feature to qualify as a pan-creole feature, it should be • widespread in an unbiased sample of creoles, i.e. in a maximal number of bi-clans, not just in the majority of creoles surveyed.
If this feature additionally is • more likely to be found in creoles than in non-creoles, and • not found in the contributing lexifier/substrates of a given creole then we will have good reasons to classify this feature as a CREOLE UNIVERSAL, i.e. a feature that has arisen through special cognitive and/or social conditions of the creolization process.
In section §6.1, I will look at features that seem wide-spread in creoles and therefore at first glance look like good candidates for pan-creole features, but on closer lexifiers together as "Western European" would drastically reduce the number of bi-clans, as the great majority of well-described creoles has a Western European lexifier. The sample of bi-clans would then be too small for quantitative evaluation.
inspection turn out to have a clear areal distribution. In section §6.2, I will examine features that occur rarely in creoles.
In §7, I will consider pan-creole features and ask whether the additional criteria are fulfilled so that they can be regarded as universal creole features.

Areally restricted features
As the focus of creole studies has long been on the major Atlantic creoles such as Jamaican, Haitian Creole, Santome and Krio, it does not come as a surprise that the grammatical features used for creole comparison have often been those which are typical of Atlantic creoles.

Features that seem widespread in creoles
The serial verb construction is a prominent type of construction which is widespread in Atlantic creoles, but which also belongs to a set of features that has been claimed by some authors to belong to the core features of creole languages in general (e.g. Bickerton 1989Bickerton , 1996Byrne 1985).  At first glance, the expectation that the majority of the creoles in APiCS show directional serial verb constructions seems to be fullfilled (see Table 3): 34 (59%) out of 58 creoles with data for this feature show a type of directional serial verb construction, whereas 24 creoles (42%) lack this construction. Already from eye-balling one can see that the large Macro-Sudan bi-clans (English-, French-, Ibero-Romance/Macro-Sudan) all feature this type of directional serial verb construction. Therefore, this bi-clan is counted twice, once for the existence of this construction and once for its absence. In this way, we capture the linguistic diversity within and across bi- Such a patterning indicates that the construction is likely to originate in the lexifier or in the substrates, and is not due to the cognitive or social conditions of creolization. And indeed, it has long been noted that this type of serial construction is found in a wide area of sub-Saharan Africa (see, e.g., Boretzky 1983). Interestingly, the It has been claimed that creoles typically show double-object constructions (Bickerton 1995, Bruyn et al. 1999  Indeed a clear majority of creoles (69%) feature the double-object construction, but again if we apply the bi-clan distribution, the majority shrinks and we nearly have an equal split between languages with exclusive double-object constructions (56%) and those with exclusive indirect-object constructions (44%). Here the bi-clan subdivision helps us to realize that the indirect-object construction in the non-Atlantic creoles, mainly in South and Southeast Asia and the Pacific, also constitutes a widely represented construction type of the world's creoles. In ditransitive constructions, creoles also clearly reflect their substrate/adstrate pattern against possibly conflicting patterns in their lexifiers. This can be detected from a comparison with the corresponding WALS map and the information on areal patterning of the constructions in question (for a detailed discussion see Haspelmath 2003 andAPiCS Consortium 2013a).
So here again, the narrow perspective on Atlantic creoles has considerably blurred the picture on creoles world-wide.
Another feature in this context is the expression of pronominal subjects in creoles.
This feature has to my knowledge not been put forward as being a typical creole feature, but somehow creolists seem to assume that a creole language has obligatory pronominal subjects, as in (5)  In addition, a striking areal pattern arises: All APiCS languages in Africa, the Atlantic and the Americas show obligatory pronoun words/affixes, as well as Australian and Pacific languages, whereas the languages of the Indian Ocean, Southeast Asia and New Guinea allow for optional pronoun words, as in (7), where there is no pronoun expressed: contact language 14 . When one restricts the view to the group of creoles in APiCS (see Table 5), 79% of the languages show obligatory pronominal subjects and 21% have optional pronoun words. In the bi-clan distribution, the figures shift towards 29% of creoles featuring optional marking against only 71% obligatory marking. Even if the figures do not change dramatically, the bi-clan perspective again reduces the weight of uniformally marked large bi-clans (here again European/Macro-Sudan) and enhances at the same time the weight of bi-clans which are represented by fewer languages (e.g. Portuguese/Indic, Spanish/Philippinic). This method thus gives a much more realistic picture of the diversity in creoles world-wide. Obligatory pronoun words are just one strategy of creoles world-wide. It so happens that Atlantic creoles overwhelmingly show this feature, but as we have seen, it does not imply that this feature is therefore a pan-creole feature.
As with the other areal features, we also suspect substrate/adstrate influence as the driving force for this clear-cut areal distribution. When we compare the corresponding WALS map (Dryer 2005a), the facts are striking: West African substrate languages show a very strong tendency to have obligatory subject pronoun words or affixes (see also Creissels 2005), and even the Portuguese-based creoles of the Atlantic consistenty show obligatory subject words whereas their lexifier Portuguese has no such strategy. For the corresponding data of South Asian and Asian substrate languages, see Haspelmath & APiCS Consortium (2013b).

Features that seem rare in creoles
Finally, I will discuss another type of features, namely those features which seem to be rare in creoles world-wide and therefore apparently negligible for the discussion of typical creole or pan-creole features (see also Bakker et al.'s (2011)    However, in the bi-clan distribution, the percentage of languages with dual forms more than doubles from 10% to 23%. This means that nearly a quarter of the creole bi-clans in APiCS do have dual forms in independent personal pronouns. The presence of dual pronouns is thus a feature that is well represented in creoles, but only in a restricted area of the world. As this area is a non-Atlantic area and comprises relatively few languages, this grammatical phenomenon has not found its way into other cross-creole comparisons (not present in Holm & Patrick 2007 nor in eWAVE ). We see once again that the bi-clan INDEPENDENT PERSONAL PRONOUNS (Haspelmath, Michaelis & APiCS Consortium 2013b). An inclusive pronoun means 'we including the hearer, i.e. you and me', and an exclusive pronoun means 'we excluding the hearer, i.e. me excluding you', as in: (9) Tok Pisin (Smith & Siegel 2013) yumi 1PL.INCL 'we' = 'you and me' vs.
mipela/mipla 1PL.EXCL 'we' = 'me (excluding you) and he/she/they' This APiCS feature, which was inspired by WALS (Cysouw 2005), shows a similar distribution in the APiCS creoles as does the preceding feature on dual pronouns.

Map 7: Inclusive/exclusive distinction in independent personal pronouns in 59 creoles
of APiCS (Haspelmath, Michaelis & APiCS Consortium 2013b) Here again, the overwhelming majority of creoles (88%) does not make the inclusive/exclusive distinction, whereas 12% of the creoles make it. In the bi-clan distribution, the inclusive/exclusive distinction again more than doubles to 26%, i.e. more than a quarter of the creole bi-clans worldwide have this distinction. Thus, this feature cannot be said to be rare in creoles in general.
Both these features, dual and inclusive/exclusive pronouns, are clearly areally restricted features worldwide. But this areal restriction is in principle of the same nature as directional serial verbs or double-object constructions in Atlantic creoles. As can be seen from the WALS map on inclusive/exclusive distinction in independent personal pronouns (Cysouw 2005), areas where such a distinction is widespread are the Philippines, Australia, and Melanesia. Thus it is clear that the presence of these features in the creoles of Australia and Melanesia is due to similar patterns in the substrates/adstrates of these contact languages (see Keesing 1988 and subsequent scholars).
For arbitrary historical reasons, these two features have never made it onto any list of pan-creole features. They are prevalent in a region of the world that has not led to a large number of well-established and well-described creole languages. But I showed earlier that directional serial verbs, double-object constructions and obligatory subject pronoun words show the same areal restrictedness, even though in a different area of the world, the Atlantic. Again for arbitrary historical reasons the Atlantic features have made it on several lists of pan-creole features even though they, too, are just areal features, but present in bi-clans with the largest number of creoles. Qualitatively, they must be treated in the same way as duals and inclusive/exclusive pronouns. Thus, none of the features discussed in this section can be considered a pan-creole feature.

Candidates for creole universals
As mentioned earlier ( §5), candidates for creole universals should fulfill three requirements. They should be (i) pan-creole features (ii) more likely to be found in creoles than in non-creoles, and (iii) not found in the contributing lexifier or substrates of a given creole.
All of the features presented in §6.1 do not even meet the first requirement, as they turn out to be areally restricted features. So which features are widespread enough over most bi-clans and could thus satisfy the first and potentially also the two other conditions for creole universals? I will consider four APiCS features here: comitative/instrumental identity, SVO order, prepositions, and occurrence of nominal plurality. We will see that only one of them is a possible creole universal.

Pan-creole features which are not creole universals
The

Map 8: Comitatives and instrumentals in 59 creoles of APiCS (Maurer & APiCS
Consortium 2013) The figures in Table 8 illustrates the cross-creole pattern. Not only does the vast majority of creoles show identity or overlap of the two functions in question (95%), but also the bi-clan distribution speaks in favor of a pan-creole feature: 92% of the creole bi-clans in APiCS can mark comitative and instrumental in the same way. When we compare these data with the corresponding WALS map (Stolz et al. 2005), the second condition cited above also seems to be fulfilled: twice as many languages and genera world-wide have different words to refer to comitative and instrumental, whereas creoles seem to prefer identical expression of both concepts.
But is the third condition also fulfilled? When we examine the lexifiers and substrates of the creoles, we see that it is clearly not fulfilled: All European lexifiers and some important African substrates, too, show the identity or overlap pattern. Therefore it is quite possible that the creoles have simply retained this polysemous marking from either lexifier or substrate languages, which weakens the idea of a creole universal that has arisen through the special cognitive and socio-cultural conditions of creolization.
The same is true for two other features which are widespread in the creoles of  Why is this feature important in the discussion of creole universals? Variable plural marking in Jamaican or Nigerian Pidgin points to diachronic processes by which new grammatical categories are on their way to being grammaticalized. Much of the old plural-marking morphology of the lexifiers got lost during the creolization process.
Therefore new strategies are being created and are gradually grammaticalized. Variability is one of the key properties of new plural -and other grammatical -markers during the grammaticalization process, where constructions have been fixed to a certain degree, but have not reached invariance in each plural context. Therefore, the behavior of plural markers in creoles is one salient feature which points in the direction where we should systematically look for universal creole features: features which reflect diachronic processes in creolization. Many of the grammatical features which I have discussed in this paper result from language change processes where essentially the lexifier's and/or substrate's structural pattern prevails in the new creole language. But the features unique to creoles, i.e. creole universals, are really diachronic universals (cf. Bybee 2006). Thus, occurrence of plural markers seems to be one of the most promising creole universals. But this is a topic for another paper.

Conclusion
If we want to generalize over creole languages, we need to avoid bias and consider cases that are as independent of each other as possible. Just counting creole languages in a large database (such as APiCS) irrespective of their genealogical and areal relatedness is not enough. Thus, I suggest that groups of creoles which are historically closely connected and share both the lexifier and the substrate type should be counted only once. In other words, rather than counting languages, one should count bi-clans.
Furthermore, I showed that features which seem wide-spread in creoles may turn out not to have a pan-creole status once the bi-clan distribution is considered. Likewise, features which seem rare in creoles world-wide turn out to be not rare, but just areally restricted, where areal restriction often points to substrate/adstrate influence.
Finally I suggested that creole universals are really diachronic universals: The loss of much grammatical marking (not only inflectional marking) and the subsequent restructuring and renewal processes in creole languages have left their unique footprints: the unusual amount of newly grammaticalized structures often entails variable marking, which then is one good diagnostic of creole grammars.