GAUTIER Gérard
METHY Daniel
March 1998
Pre-project v. 2
This text in PDF
The census of texts in PDF
If such an observation must not frighten the researcher, it imposes on him a rigor, an ever renewed vigilance, and the full consciousness of his responsabilities.
The building of a corpus of Kurdish language will then meet difficult problems ; among them, the collecting of usable texts, scarce, scattered, difficult to access, the concurrent use of several writing systems and the defining of significant linguistic limits, are the most arduous to solve.
The aim of the work we begin here is to build a dependable tool, which will encourage linguists to do research on Kurdish language, by putting at their disposal a very large set of data to which they presently cannot easily gain access.
The presence of this corpus, which will have to conform to the internationally recognized linguistic standards as they apply to such projects, will also act as a ressource for researchers already working in the kurdish field or more generally interested into Middle-East and the role Kurds play in this area. It will indeed offer a standardized access to a set of very diverse data, comprising linguistic, historical, political, anthropological elements, which cannot for now be accessed easily because of the lack of a Kurdish national archive.
Concretely, this corpus will put all these data together in computerised form on an unique medium, will use as much as possible an encoding common for all Kurdish dialects, while offering a large choice at presentation level (writing systems...), adaptable to each researcher's needs. This aspect is particularly important for Kurdish language, as, because of the political division endured by the differents areas of Kurdistan, sources (when they have fortunately been kept in good condition) are stored in different places, use different writing systems with which researchers and sometimes Kurds themselves are not always familiar.
Computerised publishing will offer a high speed access not available to common methods, as well as new possibilities of on line research in fields as Kurdish lexicography and terminology : queries all over the corpus by word of the texts, place of publishing, category of text, historical period, author, and even, depending upon the type of tagging operated on texts, topics belonging to specialised fields (anthropological, political etc).
The building of such a corpus is a very long term work, which will go on for several years. It will begin through the "electronic publishing" of some "test texts", the work on which will allow to get a better picture of the concrete difficulties of the project. Then, several texts, considered as "founder texts", will be encoded. As a researcher's companion to the texts, a set of software products specifically adapted to the use with the corpus and Kurdish language in general, in its different writing systems, will be progressively built and distributed : on-screen presentation, translitteration, processing, parametrable queries softwares, and parametrable building of secondary sets of data (KWIC concordancies, frequencies, topics...), which will increase tenfold the potential use of the corpus.
In this respect, the distribution of the Kurdish corpus through the standardised channels for linguistic sources (international linguistic agencies...), will give to Kurdish in its different dialects a place among the great research languages of the next millenium. Who would not rejoice at this perspective ?
Alain RAY [RAY, 1977] poses a lower limit of one million words, but corpora may be much bigger in size, and the present trend is to increase it. For instance, Cobuild dictionary [Cobuild, 1987] is based upon a corpus whose size is tet times bigger than this figure.
If we admit as our choice the lower limit of one million words just mentionned, which means about eight to ten books of middle size, and if the corpus gathers texts of different categories ( news, novel, poetry...), while admitting that the designation of those categories itself has to be investigated as a research topic in its own right, it seems logical to decide that this limit should be applicable to each of them.
The diachronic corpus tries to follow the evolution of the language during the chosen period, which may well be several centuries long. A good example of this type of corpus is the ARTFL corpus of French language, built through a collaboration between french national research body C.N.R.S. and Chicago University. It goes from XVIth century up to now.
Some corpora are built just by gathering texts, without any tagging. They are not unuseful at all, as to have a great amount of raw text allows for instance to study words and phrases frequencies in the language and collocations, or even to build concordances.
Concordances may be obtained through an automatic search across the whole corpus for all the occurences of a given word. If it is a synchronic corpus, they are of obvious interest to the lexicographer, but even in a diachronic corpus, which potentially presents an history of the studied language (for instance from XVIIth century to now), a set of dated concordances may well allow the historian of ideas to trace the appearing and rise of a concept or of a new meaning (a common example of this being a study using the ARTFL corpus, which unveiled the interesting fact that the use of the word "revolution" in its meaning of "political upheaval" increased significantly in the french texts during the years preceding French Revolution).
In the case of a multilingual corpus, it has already been mentionned that a common practice was to use tags to align different languages versions of a same text. Aligning may be accomplished through the use of identical numbers for the corresponding sections (sentences, paragraphs). Even for a monolingual corpus, this structural tagging is also of current use to mark the divisions of a text (pages, chapters, stanzas, etc).
However, the first expected informations in a text are probably bibliographical : author, date of publication, publisher, type of text, commentary... There is a trend to some standardization in this field [Text Encoding Initiative, 1996] [URL : Corpus Encoding Standard, 1997].
A linguistic tagging may also be applied to the words of the text, by part-of-speech (type of word). It is then spoken of part-of-speech or grammatical tagging (french étiquetage grammatical or assignation grammaticale [DE LOUPY, 1995]). A semantic tagging of the type long used in margins by anthropologists, to tag by subject parts of the text, may also be used. This proposal will be developped in the next section.
And in conclusion, as mentionned by LANGE et GAUSSIER [LANGE et GAUSSIER, 1995], "Multilingual corpora also facilitate the reflexion about each of the languages involved.".
The lexical dictionary, which is interested into language itself, is typically built in two manners : introspection, and search for citations, that is typical sentences. Corpus work may substitute to the first and speed up notably the second of these methods, as Cobuild example shows clearly. Corpus may indeed allow to fine-tune by collocations studies the extent of the semantic field of words. Moreover, an aligned corpus, by bilingual simultaneous queries for collocations, allows to precise the differences between the words linguistic fields in the two languages. This leads to translations studies, which would be a great help to a multilingual lexicography of Kurdish.
The french review Tribune des Industries de la Langue et de l'Information Electronique (Forum for Language Industries and Electronic Information) wrote : "The border between Assisted Translation and toools to help translation is little by little becoming more blurred. We may bet with a certain amount of certainty that multilingual corpora will have an important role in Assisted Translaton in the forthcoming years." [TLIE, 1995, p. 23].
And, if data on and in Kurdish language becomes available, it will be possible to re-train for Kurdish some of the already existing computer tools. Section 1.2.3.5. will show that a corpus may allow the developping of a lot of "intermediate" tools that are the roots for practical computer applications which will have a real bearing on the defense of the language.
In the ultimate aim to build such tools, texts may be submitted to a semantic tagging, which could for instance refer any architectural term found to "house". Another example is the richness in vocabulary of Kurdish language for all the clothes, which furthermore vary from an area to another. Suppose a researcher is studying Kurdish clothes. He may use the corpus to find all references to names of clothes he knows - but if there are names he is not aware of (or he doesn't think of at the moment), they will be missed. Now, if the texts of the corpus have been tagged to refer any cloth name to the term "cloth", then a computerized query by topic becomes possible.
The technique of margin cotation of texts from a set of standardized criterions, which has later evolved into computer aided cotation, is used by anthropologists since a rather long time. But this type of marking is complex to use : what level of precision is required for the place of the text to mark ? To the word, to the paragraph, to the sentence ? Furthermore, a same section may well be concerned by several topics (for instance "cooking" + "tabooes" + Bahdinan area), which is difficult to mark if using what seems to be the dominant practice now in corpus tagging systems.
And then, such a tagging needs to be able to express topics by keywords or standardized codes, which in turns means to link the query function to a thesaurus tree in which words are distributed into nested subsets (hypo-/hyper- nymia). So a query on the word "djel û berg" (clothes) would also - if the researcher requests it - the word "djemedenî".
And then, we must note that an encyclopedic work - or a specialized one - will automatically need a study in interdialectal or regional lexical differences (see Technical Document, section 1.2 - "Data normalisation"), as shows in the case of clothes the research of S. MOHSENI [MOHSENI, to be published].
Hence the question of encyclopedic tagging can be seen as very complex, and not possible to solve in the limits of the present document.
However, the reflexion on this problem as it has been conducted here yields at least one positive aspect, that is that the work on how to build a Kurdish corpus which will need ancillary software to be accessed, may in itself generate other research concerning in general the computer processing of Kurdish language. This point will be more precisely touched in the Technical Document, section 1.4, "Possible uses of data", but I will first in the next section mention some possibilites opening in my opinion a new room for the defense of Kurdish language.
Such a lemmatised lexicon may be of great use for very practical software applications contributing to the safeguard of the language. The more obvious - and the easier to build - is automatic orthographic correction of Kurdish in the market word processing software applications. Then, such a lexicon, fed to an Optical Characters Recognition software, will increase its rate of recognition, and will start a cumulative process on the existence of electronic texts in Kurdish (it will become easier to scan and recognize Kurdish texts). This cumulative effect will apply to research itself, which will become more attractive.
Such a lexicon could be distributed quite independantly from the commercial software for which it would act as a plug-in. Last, it could be used in concordance with an computerized grammar of Kurdish language to start research on Computer Assisted Translation, and, more immediately, to the semi-automatic tagging of corpus texts.
This last point meeds explanation : suppose for instance that the word derom has been referred in the lexicon to the first person of the singular of the indicative present of the verb royishtin - whatever means have been used for that, by hand or automatically, through grammar rules including verbs variation, does not concern us here. The consequence is that, each time this form is matched during a research through the text, it may then be tagged in the proper manner by a consultation of the lexicon. Having a lexicon of inflected forms allows to search simultaneously for all the forms of a given word, as a query in the lexicon will retrieve all the forms linked to particuliar form (ie royishtin to derom as well as derom to royishtin).
This section will give a minimal description of the different dialects and writing systems in use now or in the past. Besides the three big groups of dialects (northern, central, southern), it will also mention a Kurdish "dialect", which has long been the litterary standard, the Goranî, now disappeared as a living language.
In the same manner, besides the three big groups of writing systems (Roman, or Hawar, modified Arabic-Persian, modified Cyrillic), some writing systems now out of use will be presented, and specifically the "traditional" Kurdish writing system, close to the Ottoman notation of Turkish. For each of the present writing systems, a quick presentation of its evolution will be drawn if necessary.
And then, as this presentation is done from the point of view of the construction of a corpus of texts, minimum elements - diachronic as well as synchronic- will be given concerning common people or litterary expression for each of the concerned dialects.
Those dialects are written according to the area in a Roman writing system (the majority), but also in Cyrillic and Arabic-Persian writings (see further).
MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this "Kurmandjî" group the following subdialects : Sr, Akre, Amadiye, Barwr-or, Gull, Zakho and Sheikhan.
The publishing of the grammar of Djeladet BEDIR KHAN and Roger LESCOT [BEDIR KHAN and LESCOT, 1970] marks a very important date in the process of the theoretical definition of a northern Kurdish language, based on the written, standardized language which was used in the review named Hawar.
MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this group the following subdialects : Suleimaniye, Wrmwa, Bingird, Pidar, Mukr, Arbil, Rewandiz, et Xnw.
In Sulaymania, capital city of the Baban, the Ottoman Empire had created a secundary school (Rushdiye), the graduates from which could go Istanbul to continue to study there. This allowed Soranî, which was spoken in Sulaymania, to progressively replace Goranî as the litterary vehicle. MACKENZIE writes that the present Kurdish standard called Soranî is in fact a idealized version of the Sulaymania dialect, which uses the phonemic system of the Pidar and Mukr dialects.
In any case, we lack information of the exact situation of those dialects now, and would be very happy to know more before considering including occurences of them inside the corpus, specially about to which extent there is publishing in them...
An numerous immigration to Germany and Sweden gave birth to several publications. There is some difficulty with Dimîlî because of this apparent divorce between linguistics stricto sensu and political and cultural fields. Its place in a corpus of Kurdish language must then be discussed, and a work on Dimîlî seems, in any case, to be done in an independant work whose links with the building of the corpus are yet to decide precisely.
Indeed, linguists put it in the north-western Iranian group of languages, together with Bâjalânî of Mossul and Kandulai in Zagros areas. But, if they distinguish it very clearly form Kurdish itself, it is spoken in an indisputably Kurdish setting, at all levels. Moreover, it has been used as a literary language, specially for poetry, in vaste areas of central Kurdistan. It was notably the literary language of Hawraman principality - and is still spoken in this area (though the Iraq government's policy there has quite emptied Hawraman, posing a real threat to the survival of Hawramî).
It is then easy to see the difficult problems resulting from this very specific situation for the building of a Kurdish corpus. Those problems are of the same order with the ones already mentionned for Dimîlî.
Josif M. ORANSKIJ [ORANSKIJ] writes about them :
Beyond superficial affinities with southern Kurdish dialects, and specially Faylî, it is then clearly established that Lorî must be considered apart, from a linguistic as well as cultural point of view. Concerning the specific problems of corpus building, the rather recent character of written publications is also an argument for this position.
One of the first texts we have concerning the Kurds is the Sharaf Nameh, a description of Kurdish tribes and princes from the XVth century. It is a Persian text. A lot of classical Kurdish writers used Arabic, Turkish or Persian, even when then wrote about Kurds. As it has already been mentioned, when a litterary standard different from those languages will emerge, proper to the Kurds, it will be Goranî, a linguistically non Kurdish language, which will become the culturally Kurdish litterary standard.
Last, the traditional Kurdish texts essentially consist of poetry (the prose novel will rise only during the modern period, and even very late in it).
In Turkey, as soon as 1909, the Union and Progress Committees forbade Kurdish clubs. Twenty years after, Atatürk forbids the use of Kurdish language and orders the destruction of each and every Kurdish text the authorities are able to lay the hand on. In the Soviet Kurdish areas, there are indeed a lot of publications in Kurmandjî from the 20s to now, reviews, novels, poetry, but unfortunately they are very much unknown in the West.
If, since a very short time, it seems more possible than before to publish in Kurmandjî in Turkey, it is at the cost of the ever enduring physical danger to which are submitted authors, journalists, sellers, owners... a death danger. So it is not surprising that it is in the emigration - and specifically in the press - that we find the most important life of this dialect.
If the status of litterary standard seems still devoted to its southern neighbour, Soranî, the situation could well evolve quickly.
In Iraq, after 1926, when a cultural autonomy was granted to the Kurds in the Iraqian frame, Kurdish schools being allowed in the Kurdish area, several Kurdish newspapers began to be published. But after 1930, the autonomy was supressed, and it was the revolt of Shaikh Mahmud, the revolt of the Barzan brothers (1943-45) and their crushing. After the collapse of the Kurdish nationalist movement in 1975, at the very time during which officially Kurdish language was recognized, Kurdish Academy in Baghdad was allowed to publish numerous studies (see Annex A), in Kurdistan itself, with the scorched earth policy practiced by the central government and operation Anfal, which resulted in quite 200 000 deaths, we may guess that the real situation of the language was perhaps not what first met the eye.
However, from the point of view of written expression, this very date of 1975 still points to a very important transition moment, which cannot be found in the same manner in the history of Kurmandjî. This could be considered as a possible transition moment between a synchronic and a diachronic part of our corpus (for Kurmandjî, perhaps the date of the adoption of Hawar alphabet, in 1932, could be considered such a transition between two parts of a period running from the middle of XIXth century to nowadays,but this choice remains mainly conventional).
In Iran, even if Kurdish has at times been allowed to be published, it never was a teaching language. During the short-lived republic of Mahabad, in 1945, a Kurdish press took life, specifically for women and children, but it did not survive after the year the republic itself lived. Since the islamic revolution, lately, some cultural reviews are again published, using the standard writing system put forward by the Kurdish Academy on Baghdad (see section 2.4. "The different writing systems"), for northern as well as central dialects. There are also publications in Lorî.
Considered from the point of view of the different types of texts, in the beginning of the modern period, poetry keeps its preeminence, inherited from the classical period. The poetic vocabulary evolves very much, under the influence of the nationalist movement. For instance, up to a rather late period (the sixties in Soranî), it contained a lot of Arabic words (the litterati often being Arabic-trained mollahs), before knowing a relative "kurdisation".
So there is a rather unequal balance inside the written litterary expression in modern Kurdish, as there is very few works in prose, and a lot of poetic expression. So it was press publications which played a very important role into the creating of a modern written Kurdish language in whatever dialect of this language. First created in the thirties, it still has a very big role in the contemporary life of this language, notably through a real explosion of the number of newspapers published in emigration during the years 80-90. So press texts and papers should obviously be present in any endeavour to create a Kurdish corpus.
From the point of view of language standardization, the regional characteristics are mor pregnant in Kurmandjî than in Soranî, the latter having witnessed in the seventies the creation by the Kurdish Academy of a standard "Ideal Soranî", to use the expression put forward by MACKENZIE.
Any corpus work will make necessary deepeer studies into this mechanism of linguistic standardization and the problems it poses, and how a corpus research might well help to work on them. The work of Amir HASSANPOUR would be of great use here.
Last, there are also some academic publications, and technical texts. Those would constitute a very important part of a Kurdish corpus.
The future evolution seems to lead Kurdish language towards the setting up of a bidialectal standard : Soranî and Kurmandjî - even if this fact does not seem to be easily recognized even by the more politicized Kurds...
Djeladet Bedir Khan devised this notation so that it would be rather close to the alphabet adopted by the Turkish government to replace the Ottoman letters in alphabetization work when those were forbidden in 1928. Not allowing a precise notation of the realisations of speakers coming from different areas of the Kurmandjî zone, this writing is hence not totally phonetic, but it is precisely a good compromise which allows a writing common to all those different speakers. For instance, the Hawar does not allow to note the difference between aspirated and accented consonants " ç k p r t " [RIZGAR, 1993], a difference which is well written in Cyrillic notation through the use of a diacritic (posterior quote).
There are several versions of the Hawar alphabet [ÇELIKER, 1996] one using the opposition between the short (mute) vowel written " " and a long one written " i ", another notation using respectively " i " and " î " to express the same opposition.
The technical problems stemming from this "lack of phoneticity" in Hawar writing will be mentionned again in the Technical Document's section 1.2. concerned with the "Problems of data normalisation", notably to touch the question of translitteration to the cyrillic writing which is still in use nowadays by the majority of the Kurds in ex-U.S.S.R.
The Kurdish texts published using this alphabet are for the most part written in Kurmandjî dialect.
In Iraq, contrarily to Turkey, authorities always fought with the last energy any attempt to romanize Kurdish alphabets. In 1928, was published in Baghdad the first grammar written by a Kurd, Sa'îd Sidqî Kâbân. In it, the author put forward a proposal for a better adaptation of Arabic alphabet to Kurdish, mainly through adding one diacritic. The present writing is in part descended from those proposals.
So this writing system is identical neither to the defective alphabet used during their history by Ottomans not to the present Persian alphabet. In a way which perhaps got its inspiration from precisely Ottoman Turkish [DENY, 1921], it gives a new value to the isolated and final forms of the letter heh to express the sound which is written in Kurmandjî as an " e " ("a ouvert"), in phonological opposition with alif . This writing also uses a diacritical (assuming the shape of a little "v", in Kurdish hewt by reference to the figure) to distinguish two vowels which exist only in Kurdish, waw hewt and yeh hewt as well as two consonants characteristic of the central group of dialects, lam hewt and ra hewt .
This writing is defective only for the sound written in Kurmandjî " " (i mute, written as a dotless i).
Kurd speakers of northern dialects and living in Iran and Iraq also make use of this writing system.
This notation underwent an evolution during the last thirty years, so for the long vowels, and . They were first written as and . Then this writing was forsaken for and . Last, the opposition wu / wi itself lost part of its significance in oral Soranî, as mentionned in the introduction of his dictionary by H. HAKIM, who chose to write wirch - - and not wurch .
The notation of the vibrant ra has also been subject to variation : at the present moment it is mostly omitted at initial, as this letter is always vibrant there, it has been written as well as as , depending on the text, this last notation being globally much less used than the other .
At the time when Kurdistan was divided between the Ottoman and Persian empires, Kurds, whatever dialect they spoke, used this modified Arabic alphabet, as Persians and Ottomans used it themselves.
In Kurmandjî, the Kurdish-French dictionary of Auguste JABA (1879), as the texts published by Basile NIKITINE at the beginning of this century, and the texts of Mahmud BAYAZIDI, as they have been published again in 1963 in U.S.S.R. by M. RUDENKO, all use this notation. Soranî also used this writing, an Abstracta Iranica even mentions, to criticize it, a present use of it in Iran (Ibrhimpur, M.T., Dastur-e zaban-e kordi-ye sanandai, 1979, notice 124, A.I. III) !
The most notable characteristic of this writing is a relative defectivity of vowels. In his introduction to the dictionary of JABA, Ferdinant JUSTI notes orthographic variations in the texts which have been used to build the dictionary, but also wrong orthographic practices...
This first introduction will necessitate to be completed after a thorough study of the writing practices in the BAYAZIDI texts as well as in those presented by NIKITINE if these are to be integrated into the corpus.
None of the writing systems used for the different Kurdish dialects is exact from a phonological point of view. This should not surprise us, as it is the case for almost all the writing systems of almost all languages.
So MACKENZIE, for his phonetic transcription of Kurdish speech he published in 1962, used a proprietary transcription [MACKENZIE, 1961, 1962]. Unfortunately, transcriptions for Kurdish are numerous. Apart IPA, WAHBY and EDMONDS used in their Soranî dictionary a transcription perhaps inspired from Hawar, but which, if it transcribes indeed the - djim - letter with " c ", differs from Hawar with its use of digraphs "sh" and "ch" to represent and letters, respectively. The use of those digraphs has been criticized, and it seems indeed preferable, in the aim to obtain an unified translitteration for as well Arabic as Cyrillic writing systems for Kurdish, to try to extend the Hawar system to the sounds specific to central dialects. That is what is put forward (but not always put into practice) in Abstracta Iranica : ö to transcribe the sound written in Soranî (a notable difference with Wahby), and " l " or " " for , and then " r " or "
For accented consonnants in Kurmandjî, RIZGAR puts forward : ç k p r t , and, for letters with "arabic pronounciation", he writes " e " for , " h " for and " x " for , whereas it is sometimes written " " or " ". It would certainly be possible to devise a system combining those notations to encode speech for which phonetic data is available.
It has been mentionned in Section 1 that the way a corpus was organised and eventually marked depended partly on the aims of its constitution. Section 2 clarified some of the disponible data about Kurdish language. This section will try to draw conclusions from the two preceding ones, first in terms of the internal structuration of a corpus of Kurdish language, that is the final object of the work. Then will be studied the methodology to use to progressively build the corpus, with a specific emphasis on feasibility problems.
It is the notion of reusability, as put forward by Marie-Paule PERY-WOODLEY :
This classification could be if necessary precised through the use of subdialectal groups. Then, when the author tries to conform to an ideal standard ("Sulaymania model" for Soranî, "Bedir Khan" model for Kurmandjî), the subdialectal group or the area to which he belongs should be mentioned.
Clearly, this periodisation is to use with care and its pertinence is itself a question to be answered, if the aim is to make it evolve towards a more pertinent repartition of data. This question must certainly be studied at the same time the building of the corpus makes progress. In any case, the final corpus must offer to any user the possibility to select himself text according to his own criterions.
Last, D. BIBER even put forward a methodology in which a typology proper to each corpus is elaborated through a statistical treatment of several grammatical parameters of the texts it contains, which could in the future allow automatic text classification [BIBER, 1989] (though it is allowed to wonder if a classification made from a corpus which has since then received new texts is still valid...).
provisory classification is here put forward into eight categories, which will be used in the Annex A of this document :
This mode of classification must only be considered as used only for immediate convenience, or better as a minimum of information about texts delivered to the corpus user. Further research will have to be conducted during the project to at least choose the best way to make it more complete, at best devise alternate classification(s) based on parameters coming from the study of the texts themselves.
For now, however, these categories will be the ones used in Annex A of the present document. It is also from this practical classification that the next section, "3.2. Development priorities", will begin to tackle the problem of disponibility of Kurdish text.
It must be mentionned that those categories are not conceived as exclusive of each other : it should not be forbidden to simultaneously put a text into several categories, for instance some papers of the political press being as well put in the "Essay" class. Last, speciality texts, technical or pedagogical, which could also constitute a category of their own, are not defined here, only because taking the count of this type of texts is difficult - they are usually not referenced in the West even if they reach there.
Concerning the size of the corpus, it has already been mentioned that one million words was commonly considered minimum whatever the sought application. But this size is to look at in relation withnthe different categories of text comprising the corpus, supposing each one could be used as a base for a specific study. If a researcher is only interested into one period, or only in newspapers, he should still find in the corpus enough texts of this category or period to work.
Concerning the type of texts to include, the experience of the developping of other corpora shows that, if the researchers at the beginning took care of respecting specific proportions between the different types of texts they defined, in the name of corpus representativity, this last notion tends now to be questioned by linguistic arguments : considering the great number of "sub-languages" corresponding for instance to specific socio-professional fields, a true representativity would only be obtained through the integration of all the uses, which is obviously impossible.
So it seems that we now witness a move from representativity towards reusability - a concept we already met. And last, researchers take more and more into account the real availability of such or such type of text already in electronic form. For the "great languages" as English, this "reality principle" leads to build bigger and bigger corpora : the British National Corpus is 100 million words, whereas the predceding generation corpora were considered "big" with ten times less...
For less current languages, this same principle led to pragmatically question the notion of representativity : by lack of means, corpora had well to be built with whatever texts were at hand ! It is probable that a Kurdish corpus will be submitted exactly to this type of choice.
And last, the political situation of Kurds is in itself a parameter which must be taken into account in the choices for the corpus development. This situation generates an "enclosing phenomenon" for their language compared to other languages in the world : few studies exist on Kurdish language, be it linguistics or bilingual lexicography. Moreover, we already saw that the lack of unity of Kurdish populated areas had for consequence the use of different writing systems, which then reproduces the enclosing phenomenon to a lesser scale, between the different dialects.
So it seems that, to fight against this double enclosing, it would be important to accomplish first :
The Project itself will be elaborated after this second diffusion. But it is very clear in the mind of its authors that it is only - and only has the ambition to be - proposals for orientations, which will still be, at this stage, susceptible to be modified and amended. Indeed we wish our initiative will suscitate other initiatives, which we wish to be coordinated with ours in a way to define, only for the best overall efficiency. But in any case, it is not our aim to control anyone else's any work.
A first list, which already shows that those materials are finally rather numerous, will be found in Annex A, section 5.3. This preparatory census will be put together with the successive versions of the project so as to gather proposals to supplement it and critical remarks concerning the choices and criterions of choices used (defects of the chosen edition of a given text, supplementary information about other available translations etc).
However, one important point is the attribution of an unique reference number for each text, its definition still being to decide precisely, which, as numerous technical matters, will need the elaboration of technical documents. Last, a guide of reflexion or / and conversation, sort of oral version of this form, could be devised. It woud help interested persons - be they researchers, Kurdish language speakers or both - to contribute directly or discuss with others about texts which, according to them, should definitely be incorporated first into the corpus, and about the criterions to select such texts.
And last, before really thinking to constitute a full corpus, it seems essential to first produce electronic versions of some short texts. This will allow to take the measure of the conceptual and technical difficulties which are in store for the producers, and to give them a way to train themselves with the tools they will have to use. The Technical Document will give some hints about those tools.
As the knowledge of the persons taking part in the project increases, as other persons will volonteer to participate, and perhaps as specific requests concerning the use of the corpus will emerge, it should be possible to proceed to a new type of marking, using other criterions, notably encyclopedic ones (see section 1.2.3.1).
Simultaneously :