GAUTIER Gérard
METHY Daniel
March 1998
Pre-project v. 2
This text in PDF
The census of texts in PDF

Preliminary reflections for the constitution of
a national corpus of Kurdish language

1. Potential utility of a text corpus for Kurdish language

    1.1. General Preamble

      Any study in the Kurdish field immediately meets important difficulties. One of the greatest is, no need to tell, of political nature. To build a text corpus for Kurdish language cannot be seen as a neutral action : doing this is first to assert the existence of a language, of a territory, of a nation ; it is also to confront oneself with the division of this nation and of this language, a division which is as well imposed and suffered than perpetuated by Kurds themselves.

      If such an observation must not frighten the researcher, it imposes on him a rigor, an ever renewed vigilance, and the full consciousness of his responsabilities.

      The building of a corpus of Kurdish language will then meet difficult problems ; among them, the collecting of usable texts, scarce, scattered, difficult to access, the concurrent use of several writing systems and the defining of significant linguistic limits, are the most arduous to solve.

      The aim of the work we begin here is to build a dependable tool, which will encourage linguists to do research on Kurdish language, by putting at their disposal a very large set of data to which they presently cannot easily gain access.

      The presence of this corpus, which will have to conform to the internationally recognized linguistic standards as they apply to such projects, will also act as a ressource for researchers already working in the kurdish field or more generally interested into Middle-East and the role Kurds play in this area. It will indeed offer a standardized access to a set of very diverse data, comprising linguistic, historical, political, anthropological elements, which cannot for now be accessed easily because of the lack of a Kurdish national archive.

      Concretely, this corpus will put all these data together in computerised form on an unique medium, will use as much as possible an encoding common for all Kurdish dialects, while offering a large choice at presentation level (writing systems...), adaptable to each researcher's needs. This aspect is particularly important for Kurdish language, as, because of the political division endured by the differents areas of Kurdistan, sources (when they have fortunately been kept in good condition) are stored in different places, use different writing systems with which researchers and sometimes Kurds themselves are not always familiar.

      Computerised publishing will offer a high speed access not available to common methods, as well as new possibilities of on line research in fields as Kurdish lexicography and terminology : queries all over the corpus by word of the texts, place of publishing, category of text, historical period, author, and even, depending upon the type of tagging operated on texts, topics belonging to specialised fields (anthropological, political etc).

      The building of such a corpus is a very long term work, which will go on for several years. It will begin through the "electronic publishing" of some "test texts", the work on which will allow to get a better picture of the concrete difficulties of the project. Then, several texts, considered as "founder texts", will be encoded. As a researcher's companion to the texts, a set of software products specifically adapted to the use with the corpus and Kurdish language in general, in its different writing systems, will be progressively built and distributed : on-screen presentation, translitteration, processing, parametrable queries softwares, and parametrable building of secondary sets of data (KWIC concordancies, frequencies, topics...), which will increase tenfold the potential use of the corpus.

      In this respect, the distribution of the Kurdish corpus through the standardised channels for linguistic sources (international linguistic agencies...), will give to Kurdish in its different dialects a place among the great research languages of the next millenium. Who would not rejoice at this perspective ?

    1.2. Introduction : corpora techniques

      1.2.1 Basic necessities

        1.2.1.1 Minimum size

          The conception after which a text corpus can only be used to do research on those properties of language expression that belong to statistics (words or phrases frequencies, semantic fierlds...) misses the truth. However, if what is sought is to allow a lexicographic research able to disclose characteristics resorting to what is conventionnaly called the "genius" of a language, it is necessary to give access through the corpus to rather unfrequent realisations. This makes necessary to reach a minimum size, as there are realisations that occur only once or twice over one million words !

          Alain RAY [RAY, 1977] poses a lower limit of one million words, but corpora may be much bigger in size, and the present trend is to increase it. For instance, Cobuild dictionary [Cobuild, 1987] is based upon a corpus whose size is tet times bigger than this figure.

        1.2.1.2. Consistency and care in realization

          The chosen texts must abide by quality criterions (error rate), and also offer a consistency in the manner they are organized (characters sets used). Those problems will be studied in more details in the first Technical Document of the project.

        1.2.1.3 Data representativity

          There is now a debate among corpora research community concerning this concept of the representativity of the data gathered insode text corpora, which is partially challenged [PERY-WOODLEY, 1995]. However, a more intuitive notion of representativity still remains useful, as, for example, Lamartine's poems cannot embody all of the french language of the 1840s... This problem of how to choose the data to put in the corpus will be entered upon in Section 3, "Methodology of the corpus building", which will put forward an orientation.

          If we admit as our choice the lower limit of one million words just mentionned, which means about eight to ten books of middle size, and if the corpus gathers texts of different categories ( news, novel, poetry...), while admitting that the designation of those categories itself has to be investigated as a research topic in its own right, it seems logical to decide that this limit should be applicable to each of them.

      1.2.2 The different types of corpora

        There are numerous different types of corpora, and I won't try to describe them all here. I will rather concentrate on giving some basic elements useful for a reflexion about the more consistant aims for a corpus of Kurdish language. For this purpose, I will classify corpora by opposite characteristics, as does C. BALL [URL : BALL, 1997].

        1.2.2.1 Synchronic or diachronic corpus

          This opposition is concerned with the choice of texts. The synchronic corpus gathers texts from a same period in time (practically, one ten-year window or several times this period at most), and bears a more or less precise testimony of the state of a language. A good example is the corpus on which the Cobuild dictionary is based.

          The diachronic corpus tries to follow the evolution of the language during the chosen period, which may well be several centuries long. A good example of this type of corpus is the ARTFL corpus of French language, built through a collaboration between french national research body C.N.R.S. and Chicago University. It goes from XVIth century up to now.

        1.2.2.2 Monolingual or multilingual corpus

          This opposition seems clear enough. European Union gives several examples of projects in the field of multilingual corpora, such as the Lund University English-Swedish corpus, the parallel Bergen English-Norwegian corpus... The multilingual corpora may indeed be parallel, that is gathering different translations of the same texts (or sometimes of texts writen in different languages about the same subject). A typical example (and fairly known in the corpora research community) is the bilingual English-French corpus of the debates in the Canadian Parliament. Such corpora may furthermore be aligned, which means that corresponding position labels are inserted in the texts of each language, which allow to relate each segment (paragraphs, even sentences) of text across the languages of the corpus. This allows for translation or lexicography researches. In the case of Kurdish, it could very well be decided to align texts across the different Kurdish dialects.

        1.2.2.3 Tagged or untagged corpus

          It has just been mentionned it was possible to insert labels into text. This is called to tag text, a practice which is gaining currency more and more in corpus research field, for a precise aim which is for the project itself to determine.

          Some corpora are built just by gathering texts, without any tagging. They are not unuseful at all, as to have a great amount of raw text allows for instance to study words and phrases frequencies in the language and collocations, or even to build concordances.

          Concordances may be obtained through an automatic search across the whole corpus for all the occurences of a given word. If it is a synchronic corpus, they are of obvious interest to the lexicographer, but even in a diachronic corpus, which potentially presents an history of the studied language (for instance from XVIIth century to now), a set of dated concordances may well allow the historian of ideas to trace the appearing and rise of a concept or of a new meaning (a common example of this being a study using the ARTFL corpus, which unveiled the interesting fact that the use of the word "revolution" in its meaning of "political upheaval" increased significantly in the french texts during the years preceding French Revolution).

          In the case of a multilingual corpus, it has already been mentionned that a common practice was to use tags to align different languages versions of a same text. Aligning may be accomplished through the use of identical numbers for the corresponding sections (sentences, paragraphs). Even for a monolingual corpus, this structural tagging is also of current use to mark the divisions of a text (pages, chapters, stanzas, etc).

          However, the first expected informations in a text are probably bibliographical : author, date of publication, publisher, type of text, commentary... There is a trend to some standardization in this field [Text Encoding Initiative, 1996] [URL : Corpus Encoding Standard, 1997].

          A linguistic tagging may also be applied to the words of the text, by part-of-speech (type of word). It is then spoken of part-of-speech or grammatical tagging (french étiquetage grammatical or assignation grammaticale [DE LOUPY, 1995]). A semantic tagging of the type long used in margins by anthropologists, to tag by subject parts of the text, may also be used. This proposal will be developped in the next section.

          And in conclusion, as mentionned by LANGE et GAUSSIER [LANGE et GAUSSIER, 1995], "Multilingual corpora also facilitate the reflexion about each of the languages involved.".

      1.2.3 Some usages of corpora

        1.2.3.1 Language studies

          Number of studies already possible to do without corpora would be much easier if one existed. So style or authorship studies (presence of such syntaxic characteristic in an author, study on the use of such grammatical construct...). A corpus query would then substitute to the manual browsing of thousand of pages on paper. An example related to Kurdish language would the use of a Sorani corpus to extract all the occurences of "constructed" propositions to study the possible connotations of their use.

        1.2.3.2 Building of linguistic dictionaries

          As Kurdish language lacks dictionaries, it is quite obvious that one of the first usages of corpora will be lexicographic, as soon as a Kurdish language corpus will gather enough texts.

          The lexical dictionary, which is interested into language itself, is typically built in two manners : introspection, and search for citations, that is typical sentences. Corpus work may substitute to the first and speed up notably the second of these methods, as Cobuild example shows clearly. Corpus may indeed allow to fine-tune by collocations studies the extent of the semantic field of words. Moreover, an aligned corpus, by bilingual simultaneous queries for collocations, allows to precise the differences between the words linguistic fields in the two languages. This leads to translations studies, which would be a great help to a multilingual lexicography of Kurdish.

        1.2.3.3 Applications to Computer Assisted Translation

          Beyond the building of paper translation tools, we may dream to the day when computerised translation tools will begin to exist for Kurdish.

          The french review Tribune des Industries de la Langue et de l'Information Electronique (Forum for Language Industries and Electronic Information) wrote : "The border between Assisted Translation and toools to help translation is little by little becoming more blurred. We may bet with a certain amount of certainty that multilingual corpora will have an important role in Assisted Translaton in the forthcoming years." [TLIE, 1995, p. 23].

          And, if data on and in Kurdish language becomes available, it will be possible to re-train for Kurdish some of the already existing computer tools. Section 1.2.3.5. will show that a corpus may allow the developping of a lot of "intermediate" tools that are the roots for practical computer applications which will have a real bearing on the defense of the language.

        1.2.3.4 Building of encyclopedic dictionaries

          The encyclopedic dictionary (or for this matter the specialized dictionary) is not directly interested into language, but rather brings to the reader semantic data concerned with a specific field. An example would be the entry for "house" : in such a dictionary, the "definition" would in fact put the Kurdish house in its anthroplogical context, taking as a base for instance the studies by M. MOKRI [MOKRI, 1961 (1970)] or the paper by Leszek DZIEGEL [DZIEGEL, 1988]. Such a dictionary would probably also show what a kurdish houses looks like in different areas through photographs or drawings, and will propose a cross-reference to "village" or "architecture" ...

          In the ultimate aim to build such tools, texts may be submitted to a semantic tagging, which could for instance refer any architectural term found to "house". Another example is the richness in vocabulary of Kurdish language for all the clothes, which furthermore vary from an area to another. Suppose a researcher is studying Kurdish clothes. He may use the corpus to find all references to names of clothes he knows - but if there are names he is not aware of (or he doesn't think of at the moment), they will be missed. Now, if the texts of the corpus have been tagged to refer any cloth name to the term "cloth", then a computerized query by topic becomes possible.

          The technique of margin cotation of texts from a set of standardized criterions, which has later evolved into computer aided cotation, is used by anthropologists since a rather long time. But this type of marking is complex to use : what level of precision is required for the place of the text to mark ? To the word, to the paragraph, to the sentence ? Furthermore, a same section may well be concerned by several topics (for instance "cooking" + "tabooes" + Bahdinan area), which is difficult to mark if using what seems to be the dominant practice now in corpus tagging systems.

          And then, such a tagging needs to be able to express topics by keywords or standardized codes, which in turns means to link the query function to a thesaurus tree in which words are distributed into nested subsets (hypo-/hyper- nymia). So a query on the word "djel û berg" (clothes) would also - if the researcher requests it - the word "djemedenî".

          And then, we must note that an encyclopedic work - or a specialized one - will automatically need a study in interdialectal or regional lexical differences (see Technical Document, section 1.2 - "Data normalisation"), as shows in the case of clothes the research of S. MOHSENI [MOHSENI, to be published].

          Hence the question of encyclopedic tagging can be seen as very complex, and not possible to solve in the limits of the present document.

          However, the reflexion on this problem as it has been conducted here yields at least one positive aspect, that is that the work on how to build a Kurdish corpus which will need ancillary software to be accessed, may in itself generate other research concerning in general the computer processing of Kurdish language. This point will be more precisely touched in the Technical Document, section 1.4, "Possible uses of data", but I will first in the next section mention some possibilites opening in my opinion a new room for the defense of Kurdish language.

        1.2.3.5 Uses that may contribute to the defense of Kurdish language

          From a corpus, it is possible to create some data sets as : a words frequencies list, a lemmatised lexicon (lexicon of all of the inflected forms of the words, as plurals, verbal inflexions, "constructed" forms, lemmas in general), which could be stored in a relational database.

          Such a lemmatised lexicon may be of great use for very practical software applications contributing to the safeguard of the language. The more obvious - and the easier to build - is automatic orthographic correction of Kurdish in the market word processing software applications. Then, such a lexicon, fed to an Optical Characters Recognition software, will increase its rate of recognition, and will start a cumulative process on the existence of electronic texts in Kurdish (it will become easier to scan and recognize Kurdish texts). This cumulative effect will apply to research itself, which will become more attractive.

          Such a lexicon could be distributed quite independantly from the commercial software for which it would act as a plug-in. Last, it could be used in concordance with an computerized grammar of Kurdish language to start research on Computer Assisted Translation, and, more immediately, to the semi-automatic tagging of corpus texts.

          This last point meeds explanation : suppose for instance that the word derom has been referred in the lexicon to the first person of the singular of the indicative present of the verb royishtin - whatever means have been used for that, by hand or automatically, through grammar rules including verbs variation, does not concern us here. The consequence is that, each time this form is matched during a research through the text, it may then be tagged in the proper manner by a consultation of the lexicon. Having a lexicon of inflected forms allows to search simultaneously for all the forms of a given word, as a query in the lexicon will retrieve all the forms linked to particuliar form (ie royishtin to derom as well as derom to royishtin).

2. Context : Kurdish language and expression

    2.1. General introduction

      The first fact to confront when initiating a Kurdish corpus project project is :

      Those facts will have a great influence on the way the work on textual corpus will have to be accomplished. Specifically, the use of different writing systems will bear on the technical choices which will be introduced in the Technical Document which will follow the present Preliminary Reflexions.

      This section will give a minimal description of the different dialects and writing systems in use now or in the past. Besides the three big groups of dialects (northern, central, southern), it will also mention a Kurdish "dialect", which has long been the litterary standard, the Goranî, now disappeared as a living language.

      In the same manner, besides the three big groups of writing systems (Roman, or Hawar, modified Arabic-Persian, modified Cyrillic), some writing systems now out of use will be presented, and specifically the "traditional" Kurdish writing system, close to the Ottoman notation of Turkish. For each of the present writing systems, a quick presentation of its evolution will be drawn if necessary.

      And then, as this presentation is done from the point of view of the construction of a corpus of texts, minimum elements - diachronic as well as synchronic- will be given concerning common people or litterary expression for each of the concerned dialects.

    2.2. The different dialects

      2.2.1 Kurmandjî

        This group of dialects is also called "northern". It is indeed spoken in the most northern parts of Kurdistan of Iraq and Iran, in the Kurdish areas of Syria, and in almost all of the Kurdistan of Turkey - in fact everywhere save where Dimicirc;lî (Zaza) speakers live - and by the Kurds of ex-Soviet Union, and Khorassan. Contrary to the dialects of the central group of dialects, it has genders and cases, which does it a linguistically more archaic dialect than Soranî, with which it has as well several phonological differences (presence of the "v" sound).

        Those dialects are written according to the area in a Roman writing system (the majority), but also in Cyrillic and Arabic-Persian writings (see further).

        MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this "Kurmandjî" group the following subdialects : Sr, Akre, Amadiye, Barwr-or, Gull, Zakho and Sheikhan.

        The publishing of the grammar of Djeladet BEDIR KHAN and Roger LESCOT [BEDIR KHAN and LESCOT, 1970] marks a very important date in the process of the theoretical definition of a northern Kurdish language, based on the written, standardized language which was used in the review named Hawar.

      2.2.2 Soranî

        This group of dialects, generically called "Soranî", is considered to be geographically "central", by opposition with "northern" and "southern" dialects. It is spoken in the greatest part of Kurdistan of Iraq and Iran, and has no genders nor cases. It is for the most part written in an Arabic-Persian type of writing (see further).

        MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this group the following subdialects : Suleimaniye, Wrmwa, Bingird, Pidar, Mukr, Arbil, Rewandiz, et Xnw.

        In Sulaymania, capital city of the Baban, the Ottoman Empire had created a secundary school (Rushdiye), the graduates from which could go Istanbul to continue to study there. This allowed Soranî, which was spoken in Sulaymania, to progressively replace Goranî as the litterary vehicle. MACKENZIE writes that the present Kurdish standard called Soranî is in fact a idealized version of the Sulaymania dialect, which uses the phonemic system of the Pidar and Mukr dialects.

      2.2.3 Southern Kurdish dialects

        Kurdish linguists have a tendency to group these dialects with Lorî (see further). They comprise essentially Kermanshahî and the dialect of Kurds Faylî, formerly living in Baghdad area, but which have mainly been expelled to Iran by the Iraq regime, and of whom few have really returned. Concerning Kermanshahî, it seems that the Kurdish people of this area, if they indeed speak their dialect, when they write, do it mainly in Persian.

        In any case, we lack information of the exact situation of those dialects now, and would be very happy to know more before considering including occurences of them inside the corpus, specially about to which extent there is publishing in them...

      2.2.4 Dimîlî (Zaza)

        Dimîlî is traditionally counted among iranian north-western dialects. There are problems about its precise origin and classification. Hewever, the majority of the linguists who studied it agrre to consider it not as a Kurdish dialect [ORANSJKIJ]. This is a real paradox, as its speakers, whose craddle is in the western part of the Kurdistan of Turkey, are beyond discussion Kurdish people, as well culturally as from the eminent role they played in the national movement. It will be enough to mention the Dersim uprising of 1937 anf the place this area still holds in the movements to date.

        An numerous immigration to Germany and Sweden gave birth to several publications. There is some difficulty with Dimîlî because of this apparent divorce between linguistics stricto sensu and political and cultural fields. Its place in a corpus of Kurdish language must then be discussed, and a work on Dimîlî seems, in any case, to be done in an independant work whose links with the building of the corpus are yet to decide precisely.

      2.2.5 Goranî, the first literary standard

        The situation of Goranî is not unlike the one of Dimîlî. Its craddle is the Awraman (or Hawraman) area, which extends on both Iran and Iraq Kurdistan. With its complex and "double" characteristics, it bears testimony of the length and complexity of the process of formation of Kurdish populations.

        Indeed, linguists put it in the north-western Iranian group of languages, together with Bâjalânî of Mossul and Kandulai in Zagros areas. But, if they distinguish it very clearly form Kurdish itself, it is spoken in an indisputably Kurdish setting, at all levels. Moreover, it has been used as a literary language, specially for poetry, in vaste areas of central Kurdistan. It was notably the literary language of Hawraman principality - and is still spoken in this area (though the Iraq government's policy there has quite emptied Hawraman, posing a real threat to the survival of Hawramî).

        It is then easy to see the difficult problems resulting from this very specific situation for the building of a Kurdish corpus. Those problems are of the same order with the ones already mentionned for Dimîlî.

      2.2.6 Lorî

        Lorî is traditionally associated with Bakhtiarî. Those two dialects are spoken by several millions people living in an area situated at the southern limit of Kurdistan of Iran. This geographic and human proximity, as well as the strategic importance of their area of diffusion, which goes down to the Persian Gulf, sparked endless debate about the belonging or not of Lors and Bakhtiars to the Kurdish nation ; this problem could only be decided, if it has to be, by the interested people themselves. Here the difference with Dimîlî speakers (Zaza), is very clear.

        Josif M. ORANSKIJ [ORANSKIJ] writes about them :

          With Persian and Tadjik (with their own subdialects), also belong to the south-western subgroup : Tatî, Bakhtiarî and Lorî dialects, and the vaste majority of local languages spoken in Fârs."

        Beyond superficial affinities with southern Kurdish dialects, and specially Faylî, it is then clearly established that Lorî must be considered apart, from a linguistic as well as cultural point of view. Concerning the specific problems of corpus building, the rather recent character of written publications is also an argument for this position.

    2.3. Expression in Kurdish language

      2.3.1 Introduction

        The choice of te texts to put inside a Kurdish corpus must be guided by two considerations :

        1. some minimum data concerning Kurdish culture and its evolution. It is not the place here of a history of Kurdish litterature, but to define which types of texts come into being and at which period ;

        2. a census of all the available texts, specifically for texts existing in a bilingual form, without for now looking at copyright problems, which , if it poses the risk of making distribution difficult, should not hamper collection.

        This section will concentrate on the first of those two approaches, a first list of possible candidates for inclusion in a corpus will be found in Appendix A.

      2.3.2 Kurdish writing during the classical period

        The importance of oral litterature in traditional Kurdish culture must first be emphasized. Some of the great founding texts of Kurdish culture find their inspiration in themes already treated in the oral tradition, as Mem u Zin which is inspired by Mame Alan. The orientalists of the end of last century and theirs informants (Mahmud Bayazîdî) leaved us with several transcripts and translations of Kurdish tales.

        One of the first texts we have concerning the Kurds is the Sharaf Nameh, a description of Kurdish tribes and princes from the XVth century. It is a Persian text. A lot of classical Kurdish writers used Arabic, Turkish or Persian, even when then wrote about Kurds. As it has already been mentioned, when a litterary standard different from those languages will emerge, proper to the Kurds, it will be Goranî, a linguistically non Kurdish language, which will become the culturally Kurdish litterary standard.

        Last, the traditional Kurdish texts essentially consist of poetry (the prose novel will rise only during the modern period, and even very late in it).

      2.3.2 The legal status of Kurdish language in modern times

        Even for modern times, building a Kurdish textual corpus, compared to ither languages, meets specific problems. Indeed, the history of prohibition and repression to which Kurdish language has been submitted during modern times has a deep bearing onto its development.

        In Turkey, as soon as 1909, the Union and Progress Committees forbade Kurdish clubs. Twenty years after, Atatürk forbids the use of Kurdish language and orders the destruction of each and every Kurdish text the authorities are able to lay the hand on. In the Soviet Kurdish areas, there are indeed a lot of publications in Kurmandjî from the 20s to now, reviews, novels, poetry, but unfortunately they are very much unknown in the West.

        If, since a very short time, it seems more possible than before to publish in Kurmandjî in Turkey, it is at the cost of the ever enduring physical danger to which are submitted authors, journalists, sellers, owners... a death danger. So it is not surprising that it is in the emigration - and specifically in the press - that we find the most important life of this dialect.

        If the status of litterary standard seems still devoted to its southern neighbour, Soranî, the situation could well evolve quickly.

        In Iraq, after 1926, when a cultural autonomy was granted to the Kurds in the Iraqian frame, Kurdish schools being allowed in the Kurdish area, several Kurdish newspapers began to be published. But after 1930, the autonomy was supressed, and it was the revolt of Shaikh Mahmud, the revolt of the Barzan brothers (1943-45) and their crushing. After the collapse of the Kurdish nationalist movement in 1975, at the very time during which officially Kurdish language was recognized, Kurdish Academy in Baghdad was allowed to publish numerous studies (see Annex A), in Kurdistan itself, with the scorched earth policy practiced by the central government and operation Anfal, which resulted in quite 200 000 deaths, we may guess that the real situation of the language was perhaps not what first met the eye.

        However, from the point of view of written expression, this very date of 1975 still points to a very important transition moment, which cannot be found in the same manner in the history of Kurmandjî. This could be considered as a possible transition moment between a synchronic and a diachronic part of our corpus (for Kurmandjî, perhaps the date of the adoption of Hawar alphabet, in 1932, could be considered such a transition between two parts of a period running from the middle of XIXth century to nowadays,but this choice remains mainly conventional).

        In Iran, even if Kurdish has at times been allowed to be published, it never was a teaching language. During the short-lived republic of Mahabad, in 1945, a Kurdish press took life, specifically for women and children, but it did not survive after the year the republic itself lived. Since the islamic revolution, lately, some cultural reviews are again published, using the standard writing system put forward by the Kurdish Academy on Baghdad (see section 2.4. "The different writing systems"), for northern as well as central dialects. There are also publications in Lorî.

      2.3.3 Kurdish modern expression

        Considered from the point of view of the different Kurdish dialects, present times publications are mainly in Kurmandjî and Soranî - but it exists as well some texts in Dimîlî, and academic reprinting of classical Goranî texts.The precise state of Lorî publishing is unknown to me.

        Considered from the point of view of the different types of texts, in the beginning of the modern period, poetry keeps its preeminence, inherited from the classical period. The poetic vocabulary evolves very much, under the influence of the nationalist movement. For instance, up to a rather late period (the sixties in Soranî), it contained a lot of Arabic words (the litterati often being Arabic-trained mollahs), before knowing a relative "kurdisation".

        So there is a rather unequal balance inside the written litterary expression in modern Kurdish, as there is very few works in prose, and a lot of poetic expression. So it was press publications which played a very important role into the creating of a modern written Kurdish language in whatever dialect of this language. First created in the thirties, it still has a very big role in the contemporary life of this language, notably through a real explosion of the number of newspapers published in emigration during the years 80-90. So press texts and papers should obviously be present in any endeavour to create a Kurdish corpus.

        From the point of view of language standardization, the regional characteristics are mor pregnant in Kurmandjî than in Soranî, the latter having witnessed in the seventies the creation by the Kurdish Academy of a standard "Ideal Soranî", to use the expression put forward by MACKENZIE.

        Any corpus work will make necessary deepeer studies into this mechanism of linguistic standardization and the problems it poses, and how a corpus research might well help to work on them. The work of Amir HASSANPOUR would be of great use here.

        Last, there are also some academic publications, and technical texts. Those would constitute a very important part of a Kurdish corpus.

        The future evolution seems to lead Kurdish language towards the setting up of a bidialectal standard : Soranî and Kurmandjî - even if this fact does not seem to be easily recognized even by the more politicized Kurds...

    2.4. The different writing systems

      It is necessary to mention these matters here, as the existence of several writing systems has a very heavy bearing onto the practical difficulties of the building of a Kurdish corpus.

      2.4.1 The "Hawar" alphabet

        This Roman alphabet takes its name from the name of the review Hawar (The Call), published in the French-administered Syria, and where it was for the first time put forward in 1932 by the group of Kurdish intellectuals gathered around Djeladet Bedir Khan. It is now dominant among the Kurds of Turkey, be they Kurmandjî or Dimîlî speakers. A lot of reviews published in Europe in northern Kurdish use this alphabet Hawar.

        Djeladet Bedir Khan devised this notation so that it would be rather close to the alphabet adopted by the Turkish government to replace the Ottoman letters in alphabetization work when those were forbidden in 1928. Not allowing a precise notation of the realisations of speakers coming from different areas of the Kurmandjî zone, this writing is hence not totally phonetic, but it is precisely a good compromise which allows a writing common to all those different speakers. For instance, the Hawar does not allow to note the difference between aspirated and accented consonants " ç k p r t " [RIZGAR, 1993], a difference which is well written in Cyrillic notation through the use of a diacritic (posterior quote).

        There are several versions of the Hawar alphabet [ÇELIKER, 1996] one using the opposition between the short (mute) vowel written " " and a long one written " i ", another notation using respectively " i " and " î " to express the same opposition.

        The technical problems stemming from this "lack of phoneticity" in Hawar writing will be mentionned again in the Technical Document's section 1.2. concerned with the "Problems of data normalisation", notably to touch the question of translitteration to the cyrillic writing which is still in use nowadays by the majority of the Kurds in ex-U.S.S.R.

      2.4.2 The notation using Armenian alphabet

        At the beginning of the twenties, the Kurds living in Soviet Armenia started using Armenian alphabet and published a spelling book using it for kurdish schools. Some years later, this notation was replaced with a roman one (see next section). It seems however that very few Kurdish texts in Armenian alphabet came to us.

      2.4.3 The different Roman writing systems of Kurmandjî

        When a Roman writing system for Kurdish is mentionned, it is generally the Hawar. But un U.S.S.R too, in 1927, the studies of the Assyrian philologist Q. Maragulov and of Ereb emo, inspired by the roman alphabet put forward in the 1910s by I. A. Orbeli, a scholar of Armenian origin who was studying Iranian languages, led them to another roman notation of thirty-seven letters, adapted to the phonology of Kurmandjî. It was replaced before World War II with the cyrillic alphabet still in use now by decision of the Soviet authorities [BLAU, 1989, p. XI].

      2.4.4 The cyrillic alphabet

        Since World War II, the Kurds of U.S.S.R. use a Cyrillic alphabet of 39 letters (in fact 32 and a posterior quote used as diacritic), among which 6 are in fact roman letters (plus a inverted "e ") which have been added to render sounds belonging to Kurdish.

        The Kurdish texts published using this alphabet are for the most part written in Kurmandjî dialect.

      2.4.5 The modern arabic-persian alphabet

        This notation, used in the Kurdish areas of Iran and Iraq, is based on a modified arabic alphabet, and gives to the Kurdish dialects of the central group (Soranî), for which it was thought, a quasi non-defective writing, in which (with one exception) al vowels are written, and on the line, as consonants.

        In Iraq, contrarily to Turkey, authorities always fought with the last energy any attempt to romanize Kurdish alphabets. In 1928, was published in Baghdad the first grammar written by a Kurd, Sa'îd Sidqî Kâbân. In it, the author put forward a proposal for a better adaptation of Arabic alphabet to Kurdish, mainly through adding one diacritic. The present writing is in part descended from those proposals.

        So this writing system is identical neither to the defective alphabet used during their history by Ottomans not to the present Persian alphabet. In a way which perhaps got its inspiration from precisely Ottoman Turkish [DENY, 1921], it gives a new value to the isolated and final forms of the letter heh   to express the sound which is written in Kurmandjî as an " e " ("a ouvert"), in phonological opposition with alif . This writing also uses a diacritical (assuming the shape of a little "v", in Kurdish hewt by reference to the figure) to distinguish two vowels which exist only in Kurdish, waw hewt and yeh hewt as well as two consonants characteristic of the central group of dialects, lam hewt and ra hewt .

        This writing is defective only for the sound written in Kurmandjî " " (i mute, written as a dotless i).

        Kurd speakers of northern dialects and living in Iran and Iraq also make use of this writing system.

        This notation underwent an evolution during the last thirty years, so for the long vowels, and . They were first written as and . Then this writing was forsaken for and . Last, the opposition wu / wi itself lost part of its significance in oral Soranî, as mentionned in the introduction of his dictionary by H. HAKIM, who chose to write wirch - - and not wurch .

        The notation of the vibrant ra has also been subject to variation : at the present moment it is mostly omitted at initial, as this letter is always vibrant there, it has been written as well as as , depending on the text, this last notation being globally much less used than the other .

      2.4.6 The traditionnal writing of Kurdish

        This section will give informations about a writing system which is now quite out of use : the "traditional" writing, which mainly took its origins from the Turkish Ottoman system.

        At the time when Kurdistan was divided between the Ottoman and Persian empires, Kurds, whatever dialect they spoke, used this modified Arabic alphabet, as Persians and Ottomans used it themselves.

        In Kurmandjî, the Kurdish-French dictionary of Auguste JABA (1879), as the texts published by Basile NIKITINE at the beginning of this century, and the texts of Mahmud BAYAZIDI, as they have been published again in 1963 in U.S.S.R. by M. RUDENKO, all use this notation. Soranî also used this writing, an Abstracta Iranica even mentions, to criticize it, a present use of it in Iran (Ibrhimpur, M.T., Dastur-e zaban-e kordi-ye sanandai, 1979, notice 124, A.I. III) !

        The most notable characteristic of this writing is a relative defectivity of vowels. In his introduction to the dictionary of JABA, Ferdinant JUSTI notes orthographic variations in the texts which have been used to build the dictionary, but also wrong orthographic practices...

        This first introduction will necessitate to be completed after a thorough study of the writing practices in the BAYAZIDI texts as well as in those presented by NIKITINE if these are to be integrated into the corpus.

      2.4.7. Phonetic transcription of linguistic data

        In the case of texts consisting of transcriptions of recorded speech, radio broadcasting, even sound tracks for audiovisual documents, there is the question of which notation to use. Indeed, as well Kurmandjî as Soranî group have dialects using different phonological systems, different pronunciations for the same word, and even morphological differences.

        None of the writing systems used for the different Kurdish dialects is exact from a phonological point of view. This should not surprise us, as it is the case for almost all the writing systems of almost all languages.

        So MACKENZIE, for his phonetic transcription of Kurdish speech he published in 1962, used a proprietary transcription [MACKENZIE, 1961, 1962]. Unfortunately, transcriptions for Kurdish are numerous. Apart IPA, WAHBY and EDMONDS used in their Soranî dictionary a transcription perhaps inspired from Hawar, but which, if it transcribes indeed the - djim - letter with " c ", differs from Hawar with its use of digraphs "sh" and "ch" to represent and letters, respectively. The use of those digraphs has been criticized, and it seems indeed preferable, in the aim to obtain an unified translitteration for as well Arabic as Cyrillic writing systems for Kurdish, to try to extend the Hawar system to the sounds specific to central dialects. That is what is put forward (but not always put into practice) in Abstracta Iranica : ö to transcribe the sound written in Soranî (a notable difference with Wahby), and " l " or " " for , and then " r " or " " for the .

        For accented consonnants in Kurmandjî, RIZGAR puts forward : ç k p r t , and, for letters with "arabic pronounciation", he writes " e " for , " h " for and " x " for , whereas it is sometimes written " " or " ". It would certainly be possible to devise a system combining those notations to encode speech for which phonetic data is available.

      2.4.8 Recapitulative table

        The following table tries to show a recapitulation of all the writing systems used for different Kurdish dialects since the beginning of the modern period. It shows well that, because of the unceasing evolution of writing systems and of the fact that dialectal limits do not match the present political borders, the situation is rather complicated.

3. Methodological proposals for realisation

    3.1. Introduction

      All the sections of this part of the project are only proposals, made to be submitted to criticism and above all to constructive alternatives.

      It has been mentionned in Section 1 that the way a corpus was organised and eventually marked depended partly on the aims of its constitution. Section 2 clarified some of the disponible data about Kurdish language. This section will try to draw conclusions from the two preceding ones, first in terms of the internal structuration of a corpus of Kurdish language, that is the final object of the work. Then will be studied the methodology to use to progressively build the corpus, with a specific emphasis on feasibility problems.

    3.2. Descriptors structuring the corpus

      3.2.1. For a "variable geometry" corpus

        It has been mentioned that the choice of texts to work on depended on the researchers' aims. As the corpus is built precisely to encourage research on Kurdish language, it is difficult to forecast their field in advance. The corpus should not therefore be limited to one type of choice, but should offer the researchers the possibility to select themselves, using descriptors clearly precised the coherent subset of data on which they choose to base their (different) researches (for instance : all the Kurmandjî press papers published between 1900 and 1920). This section tries to put forward such descriptors.

        It is the notion of reusability, as put forward by Marie-Paule PERY-WOODLEY :

          " The multiplication of the fields of utilisation combined with the cost of collecting and preparing the corpora provoked the emergence of a new criterion : the reusability. The question of the reusability allows to ask in an other way the question of the representativity. The notion of reusable ressources implies "variable geometry" corpora, susceptible of adaptation to different methods and aims. From the available ressources, researchers should be able to build - select, reorganise - corpora answering to specific needs. " [Marie-Paule PƒRY-WOODLEY, p. 218, "Quels corpus ?", T.A.L., 1995, vol. 36, n¡ 1-2, pp. 213-232]

      3.2.2 Classification from dialect and subdialect

        First descriptor, the dialectal class, from the standards which emerged little by little from history : the group of northern dialects ("Kurmandjî"), the group of central dialects ("Soranî"), and the group of southern dialects (essentielly Lorî).

        This classification could be if necessary precised through the use of subdialectal groups. Then, when the author tries to conform to an ideal standard ("Sulaymania model" for Soranî, "Bedir Khan" model for Kurmandjî), the subdialectal group or the area to which he belongs should be mentioned.

      3.2.3 Diachronic classification

        For each of the Kurdish dialects, the classification should of course precise the date of publication, a very important parameter. The users will be so able to choose the period on which they want to work. The problems of how to build coherent subsets of the corpus will be mentioned in the next section, but as a work hypothesis, we put forward here several turning points which could be used to define diachronic subsets :

        • from the beginning of XIXth century to 1920, approximative date of the end og the Ottoman empire,

        • from 1920 to 1970, date of the treaty through which Kurds wrest some cultural rights from Iraqian government,

        • from 1970 to now, a period which witnesses the development of Kurdish language publishing, through the work of Kurdish Academy in Iraq for one part, and in emigration for another.

        Clearly, this periodisation is to use with care and its pertinence is itself a question to be answered, if the aim is to make it evolve towards a more pertinent repartition of data. This question must certainly be studied at the same time the building of the corpus makes progress. In any case, the final corpus must offer to any user the possibility to select himself text according to his own criterions.

      3.2.4 Texts typology

        Besides the classification by "litterary kind", which is the most "classical", numerous textual typologies coexist, be they based on situation (situations of producing...), sociology, function of text chunks (descriptive - narrative), linguistic parameters discourse analysis, or others. The subgroup "Typology of documents" of the EAGLE research group produced about this problem proposals which fit in the Text Endcoding Initiative (TEI) frame.

        Last, D. BIBER even put forward a methodology in which a typology proper to each corpus is elaborated through a statistical treatment of several grammatical parameters of the texts it contains, which could in the future allow automatic text classification [BIBER, 1989] (though it is allowed to wonder if a classification made from a corpus which has since then received new texts is still valid...).

        provisory classification is here put forward into eight categories, which will be used in the Annex A of this document :

        1. Popular literature
        2. Novel
        3. Poetry
        4. Press
        5. Theatre
        6. Essays
        7. Dictionaries
        8. Transcription or recording of speech

        This mode of classification must only be considered as used only for immediate convenience, or better as a minimum of information about texts delivered to the corpus user. Further research will have to be conducted during the project to at least choose the best way to make it more complete, at best devise alternate classification(s) based on parameters coming from the study of the texts themselves.

        For now, however, these categories will be the ones used in Annex A of the present document. It is also from this practical classification that the next section, "3.2. Development priorities", will begin to tackle the problem of disponibility of Kurdish text.

        It must be mentionned that those categories are not conceived as exclusive of each other : it should not be forbidden to simultaneously put a text into several categories, for instance some papers of the political press being as well put in the "Essay" class. Last, speciality texts, technical or pedagogical, which could also constitute a category of their own, are not defined here, only because taking the count of this type of texts is difficult - they are usually not referenced in the West even if they reach there.

      3.2.5 Contextual information

      The researches of D. BIBER and the already mentionned paper by M.-P. PERY-WOODLEY, as well as the recommendations of the Text Encoding Initiative lead to the conclusion that, to get maximum reusability, each of the texts of the corpus will have to be documented very precisely. For instance, TEI mentions that the original price of a publication may give information to the researcher about the intended (or touched) public, all important socio-linguistic elements. As mentionned by M.-P. Péry-Woodley :

        "To be reusable in a rational manner in studies with different methods and aims, the different elements of a corpus must bear precise indications concerning their production, they must be able to take place in a situated and descriptible manner in a model of textual variation. Although this aim is much more modest than the building of a corpus representative of all the texts produced by a language-culture, it still necessitates a theorisation of the variation able to produce a sort of matrice against which the chosen texts may be situated." (ital. G. G.) [PERY-WOODLEY, op. cit., p. 224].

      3.2.6 Conclusion

        So the following minimum descriptors appear :

        1. dialect and sub-dialect
        2. publication date
        3. type of text

        To which it is necessary to add contextual information to be precised, but which will much probably defined through a study of the EAGLE and TEI proposals.

    3.3. Development priorities

      3.3.1 Choice of the first contents : a compromise between ideal and reality

        Preceding section presented what could be a general corpus of Kurdish language if means would allow. Unfortunately, in the current state of things they are rather limited,and choices must be done in terms of development priorities, specifically to take in account the really available texts.

        Concerning the size of the corpus, it has already been mentioned that one million words was commonly considered minimum whatever the sought application. But this size is to look at in relation withnthe different categories of text comprising the corpus, supposing each one could be used as a base for a specific study. If a researcher is only interested into one period, or only in newspapers, he should still find in the corpus enough texts of this category or period to work.

        Concerning the type of texts to include, the experience of the developping of other corpora shows that, if the researchers at the beginning took care of respecting specific proportions between the different types of texts they defined, in the name of corpus representativity, this last notion tends now to be questioned by linguistic arguments : considering the great number of "sub-languages" corresponding for instance to specific socio-professional fields, a true representativity would only be obtained through the integration of all the uses, which is obviously impossible.

        So it seems that we now witness a move from representativity towards reusability - a concept we already met. And last, researchers take more and more into account the real availability of such or such type of text already in electronic form. For the "great languages" as English, this "reality principle" leads to build bigger and bigger corpora : the British National Corpus is 100 million words, whereas the predceding generation corpora were considered "big" with ten times less...

        For less current languages, this same principle led to pragmatically question the notion of representativity : by lack of means, corpora had well to be built with whatever texts were at hand ! It is probable that a Kurdish corpus will be submitted exactly to this type of choice.

        And last, the political situation of Kurds is in itself a parameter which must be taken into account in the choices for the corpus development. This situation generates an "enclosing phenomenon" for their language compared to other languages in the world : few studies exist on Kurdish language, be it linguistics or bilingual lexicography. Moreover, we already saw that the lack of unity of Kurdish populated areas had for consequence the use of different writing systems, which then reproduces the enclosing phenomenon to a lesser scale, between the different dialects.

        So it seems that, to fight against this double enclosing, it would be important to accomplish first :

        1. the electronic edition of bilingual texts (Kurdish dialect - another language)
          For competence reasons, we would start with texts having an english or french translation, but it is clear that regional languages (Turkish, Arabic, Persian) should be also included at longer term, which could be used as the basis for a call to collaboration.

        2. the electronic edition of Kurdish bidialectal texts, and in priority for the two "main" dialects, Kurmandjî - Soranî.
          For obvious reasons of means and working force, we would first focus on the two present main dialects, Kurmandjî and Soranî, which have the more numerous texts published. Obviously, it is not any theoretical stand, only a pragmatic one, and any initiative concerning other dialects would be received with great pleasure. Here again, a call for collaboration should be envisaged, once technical and methodological questions will have been decided upon.

        3. Simultaneously, it seems important to realise electronic editions of texts considered as "founding texts" of Kurdish culture. This may imply texts originally written in Goranî, but in conformity with the just defined proposed priorities, it would be better to first work on modern dialects editions, and with translations in french or english.

        And last, it will probably be easier at the beginning to work on Kurmandjî texts, only for writing system reasons. (see Technical Document).

      3.3.2 Making known the Kurdish Corpus Initiative

        The present document, which is only a pre-project version, should be broadcasted to provoke remarks concerning the viability of the project. Once corrected after those remarks, it should be translated into English, Kurmandjî and Soranî - and eventually into other (oriental) languages to be made known more broadly. This task will already need a considerable commitment to work.

        The Project itself will be elaborated after this second diffusion. But it is very clear in the mind of its authors that it is only - and only has the ambition to be - proposals for orientations, which will still be, at this stage, susceptible to be modified and amended. Indeed we wish our initiative will suscitate other initiatives, which we wish to be coordinated with ours in a way to define, only for the best overall efficiency. But in any case, it is not our aim to control anyone else's any work.

      3.3.3 Make a first census of texts susceptible to be integrated into the corpus

        One of the first works to be done to make an evaluation of the orientations to give to the project is a first census of what is theoretically available as published Kurdish texts, and than - whatever their legal status is - susceptible to be incorporated into the corpus. The aim of this census is not to determine at this early stage which texts to put into the corpus, rather to get a general picture of what is known and available.

        A first list, which already shows that those materials are finally rather numerous, will be found in Annex A, section 5.3. This preparatory census will be put together with the successive versions of the project so as to gather proposals to supplement it and critical remarks concerning the choices and criterions of choices used (defects of the chosen edition of a given text, supplementary information about other available translations etc).

      3.3.4 Proposing to everyone to contribute with their knowledge of the texts

        Simultaneously, the Annex contains a proposal for a standardized form which could help anyone to contribute with the references of texts which he knows. This work is necessary to progress, but overall, we hope it will help us to think together about how the corpus could be built, nurture a natural reflexion about it. In its present state, this form itself must be submitted to criticism and will be put together with the successive versions of the project.

        However, one important point is the attribution of an unique reference number for each text, its definition still being to decide precisely, which, as numerous technical matters, will need the elaboration of technical documents. Last, a guide of reflexion or / and conversation, sort of oral version of this form, could be devised. It woud help interested persons - be they researchers, Kurdish language speakers or both - to contribute directly or discuss with others about texts which, according to them, should definitely be incorporated first into the corpus, and about the criterions to select such texts.

      3.3.5 Beginning to pragmatically gather texts

        While beginning to gather data about what has been published - this is the census which has just been mentionned - it must be taken in account that the field is quite new. The form and the census should also simultaneously be used to gather the most possible texts already in electronic form ("pragmatic" approach).

      3.3.6 The question of the marking of texts

        We saw that the choice of the type of marking was dependant on the aims of the building of the corpus. Should the latter have one or several aims ? In relation with what was put forward in the preceding section, it seems that, because of the ambition of the project, it would be better to limit first the marking of the texts to a structural one, a choice which is by the way coherent with the approach of a work with bilingual texts and about translation.

        And last, before really thinking to constitute a full corpus, it seems essential to first produce electronic versions of some short texts. This will allow to take the measure of the conceptual and technical difficulties which are in store for the producers, and to give them a way to train themselves with the tools they will have to use. The Technical Document will give some hints about those tools.

        As the knowledge of the persons taking part in the project increases, as other persons will volonteer to participate, and perhaps as specific requests concerning the use of the corpus will emerge, it should be possible to proceed to a new type of marking, using other criterions, notably encyclopedic ones (see section 1.2.3.1).

    3.5. Chronology

      3.5.1 Work which will have to go on during the whole project

    3.5.2. July 97 to January 98

    3.5.3 : January to June 98

    Simultaneously :

    4. Bibliography