GAUTIER Gérard
METHY Daniel
March 1998
Pre-project v. 2
This text in PDF
The census of texts in PDF

Preliminary reflections for the constitution of
a national corpus of Kurdish language

1. Potential utility of a text corpus for Kurdish language

1.1. General Preamble

If such an observation must not frighten the researcher, it imposes on him a rigor, an ever renewed vigilance, and the full consciousness of his responsabilities.

The building of a corpus of Kurdish language will then meet difficult problems ; among them, the collecting of usable texts, scarce, scattered, difficult to access, the concurrent use of several writing systems and the defining of significant linguistic limits, are the most arduous to solve.

The aim of the work we begin here is to build a dependable tool, which will encourage linguists to do research on Kurdish language, by putting at their disposal a very large set of data to which they presently cannot easily gain access.

The presence of this corpus, which will have to conform to the internationally recognized linguistic standards as they apply to such projects, will also act as a ressource for researchers already working in the kurdish field or more generally interested into Middle-East and the role Kurds play in this area. It will indeed offer a standardized access to a set of very diverse data, comprising linguistic, historical, political, anthropological elements, which cannot for now be accessed easily because of the lack of a Kurdish national archive.

Concretely, this corpus will put all these data together in computerised form on an unique medium, will use as much as possible an encoding common for all Kurdish dialects, while offering a large choice at presentation level (writing systems...), adaptable to each researcher's needs. This aspect is particularly important for Kurdish language, as, because of the political division endured by the differents areas of Kurdistan, sources (when they have fortunately been kept in good condition) are stored in different places, use different writing systems with which researchers and sometimes Kurds themselves are not always familiar.

Computerised publishing will offer a high speed access not available to common methods, as well as new possibilities of on line research in fields as Kurdish lexicography and terminology : queries all over the corpus by word of the texts, place of publishing, category of text, historical period, author, and even, depending upon the type of tagging operated on texts, topics belonging to specialised fields (anthropological, political etc).

The building of such a corpus is a very long term work, which will go on for several years. It will begin through the "electronic publishing" of some "test texts", the work on which will allow to get a better picture of the concrete difficulties of the project. Then, several texts, considered as "founder texts", will be encoded. As a researcher's companion to the texts, a set of software products specifically adapted to the use with the corpus and Kurdish language in general, in its different writing systems, will be progressively built and distributed : on-screen presentation, translitteration, processing, parametrable queries softwares, and parametrable building of secondary sets of data (KWIC concordancies, frequencies, topics...), which will increase tenfold the potential use of the corpus.

In this respect, the distribution of the Kurdish corpus through the standardised channels for linguistic sources (international linguistic agencies...), will give to Kurdish in its different dialects a place among the great research languages of the next millenium. Who would not rejoice at this perspective ?

1.2. Introduction : corpora techniques

1.2.1 Basic necessities

1.2.1.1 Minimum size

Alain RAY [RAY, 1977] poses a lower limit of one million words, but corpora may be much bigger in size, and the present trend is to increase it. For instance, Cobuild dictionary [Cobuild, 1987] is based upon a corpus whose size is tet times bigger than this figure.

1.2.1.2. Consistency and care in realization

Technical Document

1.2.1.3 Data representativity

If we admit as our choice the lower limit of one million words just mentionned, which means about eight to ten books of middle size, and if the corpus gathers texts of different categories ( news, novel, poetry...), while admitting that the designation of those categories itself has to be investigated as a research topic in its own right, it seems logical to decide that this limit should be applicable to each of them.

1.2.2 The different types of corpora

1.2.2.1 Synchronic or diachronic corpus

Cobuild

The diachronic corpus tries to follow the evolution of the language during the chosen period, which may well be several centuries long. A good example of this type of corpus is the ARTFL corpus of French language, built through a collaboration between french national research body C.N.R.S. and Chicago University. It goes from XVIth century up to now.

1.2.2.2 Monolingual or multilingual corpus

parallel

aligned

align texts across the different Kurdish dialects

1.2.2.3 Tagged or untagged corpus

tag

Some corpora are built just by gathering texts, without any tagging. They are not unuseful at all, as to have a great amount of raw text allows for instance to study words and phrases frequencies in the language and collocations, or even to build concordances.

Concordances may be obtained through an automatic search across the whole corpus for all the occurences of a given word. If it is a synchronic corpus, they are of obvious interest to the lexicographer, but even in a diachronic corpus, which potentially presents an history of the studied language (for instance from XVIIth century to now), a set of dated concordances may well allow the historian of ideas to trace the appearing and rise of a concept or of a new meaning (a common example of this being a study using the ARTFL corpus, which unveiled the interesting fact that the use of the word "revolution" in its meaning of "political upheaval" increased significantly in the french texts during the years preceding French Revolution).

In the case of a multilingual corpus, it has already been mentionned that a common practice was to use tags to align different languages versions of a same text. Aligning may be accomplished through the use of identical numbers for the corresponding sections (sentences, paragraphs). Even for a monolingual corpus, this structural tagging is also of current use to mark the divisions of a text (pages, chapters, stanzas, etc).

However, the first expected informations in a text are probably bibliographical : author, date of publication, publisher, type of text, commentary... There is a trend to some standardization in this field [Text Encoding Initiative, 1996] [URL : Corpus Encoding Standard, 1997].

A linguistic tagging may also be applied to the words of the text, by part-of-speech (type of word). It is then spoken of part-of-speech or grammatical tagging (french étiquetage grammatical or assignation grammaticale [DE LOUPY, 1995]). A semantic tagging of the type long used in margins by anthropologists, to tag by subject parts of the text, may also be used. This proposal will be developped in the next section.

And in conclusion, as mentionned by LANGE et GAUSSIER [LANGE et GAUSSIER, 1995], "Multilingual corpora also facilitate the reflexion about each of the languages involved.".

1.2.3 Some usages of corpora

1.2.3.1 Language studies

1.2.3.2 Building of linguistic dictionaries

The lexical dictionary, which is interested into language itself, is typically built in two manners : introspection, and search for citations, that is typical sentences. Corpus work may substitute to the first and speed up notably the second of these methods, as Cobuild example shows clearly. Corpus may indeed allow to fine-tune by collocations studies the extent of the semantic field of words. Moreover, an aligned corpus, by bilingual simultaneous queries for collocations, allows to precise the differences between the words linguistic fields in the two languages. This leads to translations studies, which would be a great help to a multilingual lexicography of Kurdish.

1.2.3.3 Applications to Computer Assisted Translation

paper

computerised

The french review Tribune des Industries de la Langue et de l'Information Electronique (Forum for Language Industries and Electronic Information) wrote : "The border between Assisted Translation and toools to help translation is little by little becoming more blurred. We may bet with a certain amount of certainty that multilingual corpora will have an important role in Assisted Translaton in the forthcoming years." [TLIE, 1995, p. 23].

And, if data on and in Kurdish language becomes available, it will be possible to re-train for Kurdish some of the already existing computer tools. Section 1.2.3.5. will show that a corpus may allow the developping of a lot of "intermediate" tools that are the roots for practical computer applications which will have a real bearing on the defense of the language.

1.2.3.4 Building of encyclopedic dictionaries

In the ultimate aim to build such tools, texts may be submitted to a semantic tagging, which could for instance refer any architectural term found to "house". Another example is the richness in vocabulary of Kurdish language for all the clothes, which furthermore vary from an area to another. Suppose a researcher is studying Kurdish clothes. He may use the corpus to find all references to names of clothes he knows - but if there are names he is not aware of (or he doesn't think of at the moment), they will be missed. Now, if the texts of the corpus have been tagged to refer any cloth name to the term "cloth", then a computerized query by topic becomes possible.

The technique of margin cotation of texts from a set of standardized criterions, which has later evolved into computer aided cotation, is used by anthropologists since a rather long time. But this type of marking is complex to use : what level of precision is required for the place of the text to mark ? To the word, to the paragraph, to the sentence ? Furthermore, a same section may well be concerned by several topics (for instance "cooking" + "tabooes" + Bahdinan area), which is difficult to mark if using what seems to be the dominant practice now in corpus tagging systems.

And then, such a tagging needs to be able to express topics by keywords or standardized codes, which in turns means to link the query function to a thesaurus tree in which words are distributed into nested subsets (hypo-/hyper- nymia). So a query on the word "djel û berg" (clothes) would also - if the researcher requests it - the word "djemedenî".

And then, we must note that an encyclopedic work - or a specialized one - will automatically need a study in interdialectal or regional lexical differences (see Technical Document, section 1.2 - "Data normalisation"), as shows in the case of clothes the research of S. MOHSENI [MOHSENI, to be published].

Hence the question of encyclopedic tagging can be seen as very complex, and not possible to solve in the limits of the present document.

However, the reflexion on this problem as it has been conducted here yields at least one positive aspect, that is that the work on how to build a Kurdish corpus which will need ancillary software to be accessed, may in itself generate other research concerning in general the computer processing of Kurdish language. This point will be more precisely touched in the Technical Document, section 1.4, "Possible uses of data", but I will first in the next section mention some possibilites opening in my opinion a new room for the defense of Kurdish language.

1.2.3.5 Uses that may contribute to the defense of Kurdish language

lemmatised lexicon

lemmas

Such a lemmatised lexicon may be of great use for very practical software applications contributing to the safeguard of the language. The more obvious - and the easier to build - is automatic orthographic correction of Kurdish in the market word processing software applications. Then, such a lexicon, fed to an Optical Characters Recognition software, will increase its rate of recognition, and will start a cumulative process on the existence of electronic texts in Kurdish (it will become easier to scan and recognize Kurdish texts). This cumulative effect will apply to research itself, which will become more attractive.

Such a lexicon could be distributed quite independantly from the commercial software for which it would act as a plug-in. Last, it could be used in concordance with an computerized grammar of Kurdish language to start research on Computer Assisted Translation, and, more immediately, to the semi-automatic tagging of corpus texts.

This last point meeds explanation : suppose for instance that the word derom has been referred in the lexicon to the first person of the singular of the indicative present of the verb royishtin - whatever means have been used for that, by hand or automatically, through grammar rules including verbs variation, does not concern us here. The consequence is that, each time this form is matched during a research through the text, it may then be tagged in the proper manner by a consultation of the lexicon. Having a lexicon of inflected forms allows to search simultaneously for all the forms of a given word, as a query in the lexicon will retrieve all the forms linked to particuliar form (ie royishtin to derom as well as derom to royishtin).

2. Context : Kurdish language and expression

2.1. General introduction

from a synchronic point of view, the division of this language between different dialects and writing systems (without any exact matching between those two parameters),
from a diachronic point of view, the appearance at different periods of different standards for litterary written expression, and a modification (sometimes radical) of the writing systems in use, and their constant evolution.

Technical Document

Preliminary Reflexions

This section will give a minimal description of the different dialects and writing systems in use now or in the past. Besides the three big groups of dialects (northern, central, southern), it will also mention a Kurdish "dialect", which has long been the litterary standard, the Goranî, now disappeared as a living language.

In the same manner, besides the three big groups of writing systems (Roman, or Hawar, modified Arabic-Persian, modified Cyrillic), some writing systems now out of use will be presented, and specifically the "traditional" Kurdish writing system, close to the Ottoman notation of Turkish. For each of the present writing systems, a quick presentation of its evolution will be drawn if necessary.

And then, as this presentation is done from the point of view of the construction of a corpus of texts, minimum elements - diachronic as well as synchronic- will be given concerning common people or litterary expression for each of the concerned dialects.

2.2. The different dialects

2.2.1 Kurmandjî

Those dialects are written according to the area in a Roman writing system (the majority), but also in Cyrillic and Arabic-Persian writings (see further).

MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this "Kurmandjî" group the following subdialects : Sr, Akre, Amadiye, Barwr-or, Gull, Zakho and Sheikhan.

The publishing of the grammar of Djeladet BEDIR KHAN and Roger LESCOT [BEDIR KHAN and LESCOT, 1970] marks a very important date in the process of the theoretical definition of a northern Kurdish language, based on the written, standardized language which was used in the review named Hawar.

2.2.2 Soranî

MACKENZIE [MACKENZIE, 1961 (1990)] distinguishes in this group the following subdialects : Suleimaniye, Wrmwa, Bingird, Pidar, Mukr, Arbil, Rewandiz, et Xnw.

In Sulaymania, capital city of the Baban, the Ottoman Empire had created a secundary school (Rushdiye), the graduates from which could go Istanbul to continue to study there. This allowed Soranî, which was spoken in Sulaymania, to progressively replace Goranî as the litterary vehicle. MACKENZIE writes that the present Kurdish standard called Soranî is in fact a idealized version of the Sulaymania dialect, which uses the phonemic system of the Pidar and Mukr dialects.

2.2.3 Southern Kurdish dialects

speak

In any case, we lack information of the exact situation of those dialects now, and would be very happy to know more before considering including occurences of them inside the corpus, specially about to which extent there is publishing in them...

2.2.4 Dimîlî (Zaza)

An numerous immigration to Germany and Sweden gave birth to several publications. There is some difficulty with Dimîlî because of this apparent divorce between linguistics stricto sensu and political and cultural fields. Its place in a corpus of Kurdish language must then be discussed, and a work on Dimîlî seems, in any case, to be done in an independant work whose links with the building of the corpus are yet to decide precisely.

2.2.5 Goranî, the first literary standard

Indeed, linguists put it in the north-western Iranian group of languages, together with Bâjalânî of Mossul and Kandulai in Zagros areas. But, if they distinguish it very clearly form Kurdish itself, it is spoken in an indisputably Kurdish setting, at all levels. Moreover, it has been used as a literary language, specially for poetry, in vaste areas of central Kurdistan. It was notably the literary language of Hawraman principality - and is still spoken in this area (though the Iraq government's policy there has quite emptied Hawraman, posing a real threat to the survival of Hawramî).

It is then easy to see the difficult problems resulting from this very specific situation for the building of a Kurdish corpus. Those problems are of the same order with the ones already mentionned for Dimîlî.

2.2.6 Lorî

Josif M. ORANSKIJ [ORANSKIJ] writes about them :

With Persian and Tadjik (with their own subdialects), also belong to the south-western subgroup : Tatî, Bakhtiarî and Lorî dialects, and the vaste majority of local languages spoken in Fârs."

Beyond superficial affinities with southern Kurdish dialects, and specially Faylî, it is then clearly established that Lorî must be considered apart, from a linguistic as well as cultural point of view. Concerning the specific problems of corpus building, the rather recent character of written publications is also an argument for this position.

2.3. Expression in Kurdish language

2.3.1 Introduction

some minimum data concerning Kurdish culture and its evolution. It is not the place here of a history of Kurdish litterature, but to define which types of texts come into being and at which period ;
a census of all the available texts, specifically for texts existing in a bilingual form, without for now looking at copyright problems, which , if it poses the risk of making distribution difficult, should not hamper collection.

2.3.2 Kurdish writing during the classical period

Mem u Zin

Mame Alan

One of the first texts we have concerning the Kurds is the Sharaf Nameh, a description of Kurdish tribes and princes from the XVth century. It is a Persian text. A lot of classical Kurdish writers used Arabic, Turkish or Persian, even when then wrote about Kurds. As it has already been mentioned, when a litterary standard different from those languages will emerge, proper to the Kurds, it will be Goranî, a linguistically non Kurdish language, which will become the culturally Kurdish litterary standard.

Last, the traditional Kurdish texts essentially consist of poetry (the prose novel will rise only during the modern period, and even very late in it).

2.3.2 The legal status of Kurdish language in modern times

In Turkey, as soon as 1909, the Union and Progress Committees forbade Kurdish clubs. Twenty years after, Atatürk forbids the use of Kurdish language and orders the destruction of each and every Kurdish text the authorities are able to lay the hand on. In the Soviet Kurdish areas, there are indeed a lot of publications in Kurmandjî from the 20s to now, reviews, novels, poetry, but unfortunately they are very much unknown in the West.

If, since a very short time, it seems more possible than before to publish in Kurmandjî in Turkey, it is at the cost of the ever enduring physical danger to which are submitted authors, journalists, sellers, owners... a death danger. So it is not surprising that it is in the emigration - and specifically in the press - that we find the most important life of this dialect.

If the status of litterary standard seems still devoted to its southern neighbour, Soranî, the situation could well evolve quickly.

In Iraq, after 1926, when a cultural autonomy was granted to the Kurds in the Iraqian frame, Kurdish schools being allowed in the Kurdish area, several Kurdish newspapers began to be published. But after 1930, the autonomy was supressed, and it was the revolt of Shaikh Mahmud, the revolt of the Barzan brothers (1943-45) and their crushing. After the collapse of the Kurdish nationalist movement in 1975, at the very time during which officially Kurdish language was recognized, Kurdish Academy in Baghdad was allowed to publish numerous studies (see Annex A), in Kurdistan itself, with the scorched earth policy practiced by the central government and operation Anfal, which resulted in quite 200 000 deaths, we may guess that the real situation of the language was perhaps not what first met the eye.

However, from the point of view of written expression, this very date of 1975 still points to a very important transition moment, which cannot be found in the same manner in the history of Kurmandjî. This could be considered as a possible transition moment between a synchronic and a diachronic part of our corpus (for Kurmandjî, perhaps the date of the adoption of Hawar alphabet, in 1932, could be considered such a transition between two parts of a period running from the middle of XIX^th century to nowadays,but this choice remains mainly conventional).

In Iran, even if Kurdish has at times been allowed to be published, it never was a teaching language. During the short-lived republic of Mahabad, in 1945, a Kurdish press took life, specifically for women and children, but it did not survive after the year the republic itself lived. Since the islamic revolution, lately, some cultural reviews are again published, using the standard writing system put forward by the Kurdish Academy on Baghdad (see section 2.4. "The different writing systems"), for northern as well as central dialects. There are also publications in Lorî.

2.3.3 Kurdish modern expression

Considered from the point of view of the different types of texts, in the beginning of the modern period, poetry keeps its preeminence, inherited from the classical period. The poetic vocabulary evolves very much, under the influence of the nationalist movement. For instance, up to a rather late period (the sixties in Soranî), it contained a lot of Arabic words (the litterati often being Arabic-trained mollahs), before knowing a relative "kurdisation".

So there is a rather unequal balance inside the written litterary expression in modern Kurdish, as there is very few works in prose, and a lot of poetic expression. So it was press publications which played a very important role into the creating of a modern written Kurdish language in whatever dialect of this language. First created in the thirties, it still has a very big role in the contemporary life of this language, notably through a real explosion of the number of newspapers published in emigration during the years 80-90. So press texts and papers should obviously be present in any endeavour to create a Kurdish corpus.

From the point of view of language standardization, the regional characteristics are mor pregnant in Kurmandjî than in Soranî, the latter having witnessed in the seventies the creation by the Kurdish Academy of a standard "Ideal Soranî", to use the expression put forward by MACKENZIE.

Any corpus work will make necessary deepeer studies into this mechanism of linguistic standardization and the problems it poses, and how a corpus research might well help to work on them. The work of Amir HASSANPOUR would be of great use here.

Last, there are also some academic publications, and technical texts. Those would constitute a very important part of a Kurdish corpus.

The future evolution seems to lead Kurdish language towards the setting up of a bidialectal standard : Soranî and Kurmandjî - even if this fact does not seem to be easily recognized even by the more politicized Kurds...

2.4. The different writing systems

2.4.1 The "Hawar" alphabet

Hawar

Djeladet Bedir Khan devised this notation so that it would be rather close to the alphabet adopted by the Turkish government to replace the Ottoman letters in alphabetization work when those were forbidden in 1928. Not allowing a precise notation of the realisations of speakers coming from different areas of the Kurmandjî zone, this writing is hence not totally phonetic, but it is precisely a good compromise which allows a writing common to all those different speakers. For instance, the Hawar does not allow to note the difference between aspirated and accented consonants " ç k p r t " [RIZGAR, 1993], a difference which is well written in Cyrillic notation through the use of a diacritic (posterior quote).

There are several versions of the Hawar alphabet [ÇELIKER, 1996] one using the opposition between the short (mute) vowel written " " and a long one written " i ", another notation using respectively " i " and " î " to express the same opposition.

The technical problems stemming from this "lack of phoneticity" in Hawar writing will be mentionned again in the Technical Document's section 1.2. concerned with the "Problems of data normalisation", notably to touch the question of translitteration to the cyrillic writing which is still in use nowadays by the majority of the Kurds in ex-U.S.S.R.

2.4.2 The notation using Armenian alphabet

2.4.3 The different Roman writing systems of Kurmandjî

Hawar

2.4.4 The cyrillic alphabet

The Kurdish texts published using this alphabet are for the most part written in Kurmandjî dialect.

2.4.5 The modern arabic-persian alphabet

In Iraq, contrarily to Turkey, authorities always fought with the last energy any attempt to romanize Kurdish alphabets. In 1928, was published in Baghdad the first grammar written by a Kurd, Sa'îd Sidqî Kâbân. In it, the author put forward a proposal for a better adaptation of Arabic alphabet to Kurdish, mainly through adding one diacritic. The present writing is in part descended from those proposals.

So this writing system is identical neither to the defective alphabet used during their history by Ottomans not to the present Persian alphabet. In a way which perhaps got its inspiration from precisely Ottoman Turkish [DENY, 1921], it gives a new value to the isolated and final forms of the letter heh to express the sound which is written in Kurmandjî as an " e " ("a ouvert"), in phonological opposition with alif . This writing also uses a diacritical (assuming the shape of a little "v", in Kurdish hewt by reference to the figure) to distinguish two vowels which exist only in Kurdish, waw hewt and yeh hewt as well as two consonants characteristic of the central group of dialects, lam hewt and ra hewt .

This writing is defective only for the sound written in Kurmandjî " " (i mute, written as a dotless i).

Kurd speakers of northern dialects and living in Iran and Iraq also make use of this writing system.

This notation underwent an evolution during the last thirty years, so for the long vowels, and . They were first written as and . Then this writing was forsaken for and . Last, the opposition wu / wi itself lost part of its significance in oral Soranî, as mentionned in the introduction of his dictionary by H. HAKIM, who chose to write wirch - - and not wurch .

The notation of the vibrant ra has also been subject to variation : at the present moment it is mostly omitted at initial, as this letter is always vibrant there, it has been written as well as as , depending on the text, this last notation being globally much less used than the other .

2.4.6 The traditionnal writing of Kurdish

At the time when Kurdistan was divided between the Ottoman and Persian empires, Kurds, whatever dialect they spoke, used this modified Arabic alphabet, as Persians and Ottomans used it themselves.

In Kurmandjî, the Kurdish-French dictionary of Auguste JABA (1879), as the texts published by Basile NIKITINE at the beginning of this century, and the texts of Mahmud BAYAZIDI, as they have been published again in 1963 in U.S.S.R. by M. RUDENKO, all use this notation. Soranî also used this writing, an Abstracta Iranica even mentions, to criticize it, a present use of it in Iran (Ibrhimpur, M.T., Dastur-e zaban-e kordi-ye sanandai, 1979, notice 124, A.I. III) !

The most notable characteristic of this writing is a relative defectivity of vowels. In his introduction to the dictionary of JABA, Ferdinant JUSTI notes orthographic variations in the texts which have been used to build the dictionary, but also wrong orthographic practices...

This first introduction will necessitate to be completed after a thorough study of the writing practices in the BAYAZIDI texts as well as in those presented by NIKITINE if these are to be integrated into the corpus.

2.4.7. Phonetic transcription of linguistic data

None of the writing systems used for the different Kurdish dialects is exact from a phonological point of view. This should not surprise us, as it is the case for almost all the writing systems of almost all languages.

So MACKENZIE, for his phonetic transcription of Kurdish speech he published in 1962, used a proprietary transcription [MACKENZIE, 1961, 1962]. Unfortunately, transcriptions for Kurdish are numerous. Apart IPA, WAHBY and EDMONDS used in their Soranî dictionary a transcription perhaps inspired from Hawar, but which, if it transcribes indeed the - djim - letter with " c ", differs from Hawar with its use of digraphs "sh" and "ch" to represent and letters, respectively. The use of those digraphs has been criticized, and it seems indeed preferable, in the aim to obtain an unified translitteration for as well Arabic as Cyrillic writing systems for Kurdish, to try to extend the Hawar system to the sounds specific to central dialects. That is what is put forward (but not always put into practice) in Abstracta Iranica : ö to transcribe the sound written in Soranî (a notable difference with Wahby), and " l " or " " for , and then " r " or " " for the .

For accented consonnants in Kurmandjî, RIZGAR puts forward : ç k p r t , and, for letters with "arabic pronounciation", he writes " e " for , " h " for and " x " for , whereas it is sometimes written " " or " ". It would certainly be possible to devise a system combining those notations to encode speech for which phonetic data is available.

2.4.8 Recapitulative table

3. Methodological proposals for realisation

3.1. Introduction

It has been mentionned in Section 1 that the way a corpus was organised and eventually marked depended partly on the aims of its constitution. Section 2 clarified some of the disponible data about Kurdish language. This section will try to draw conclusions from the two preceding ones, first in terms of the internal structuration of a corpus of Kurdish language, that is the final object of the work. Then will be studied the methodology to use to progressively build the corpus, with a specific emphasis on feasibility problems.

3.2. Descriptors structuring the corpus

3.2.1. For a "variable geometry" corpus

It is the notion of reusability, as put forward by Marie-Paule PERY-WOODLEY :

T.A.L.

3.2.2 Classification from dialect and subdialect

dialectal

This classification could be if necessary precised through the use of subdialectal groups. Then, when the author tries to conform to an ideal standard ("Sulaymania model" for Soranî, "Bedir Khan" model for Kurmandjî), the subdialectal group or the area to which he belongs should be mentioned.

3.2.3 Diachronic classification

from the beginning of XIXth century to 1920, approximative date of the end og the Ottoman empire,
from 1920 to 1970, date of the treaty through which Kurds wrest some cultural rights from Iraqian government,
from 1970 to now, a period which witnesses the development of Kurdish language publishing, through the work of Kurdish Academy in Iraq for one part, and in emigration for another.

Clearly, this periodisation is to use with care and its pertinence is itself a question to be answered, if the aim is to make it evolve towards a more pertinent repartition of data. This question must certainly be studied at the same time the building of the corpus makes progress. In any case, the final corpus must offer to any user the possibility to select himself text according to his own criterions.

3.2.4 Texts typology

Last, D. BIBER even put forward a methodology in which a typology proper to each corpus is elaborated through a statistical treatment of several grammatical parameters of the texts it contains, which could in the future allow automatic text classification [BIBER, 1989] (though it is allowed to wonder if a classification made from a corpus which has since then received new texts is still valid...).

provisory classification is here put forward into eight categories, which will be used in the Annex A of this document :

Popular literature
Novel
Poetry
Press
Theatre
Essays
Dictionaries
Transcription or recording of speech

This mode of classification must only be considered as used only for immediate convenience, or better as a minimum of information about texts delivered to the corpus user. Further research will have to be conducted during the project to at least choose the best way to make it more complete, at best devise alternate classification(s) based on parameters coming from the study of the texts themselves.

For now, however, these categories will be the ones used in Annex A of the present document. It is also from this practical classification that the next section, "3.2. Development priorities", will begin to tackle the problem of disponibility of Kurdish text.

It must be mentionned that those categories are not conceived as exclusive of each other : it should not be forbidden to simultaneously put a text into several categories, for instance some papers of the political press being as well put in the "Essay" class. Last, speciality texts, technical or pedagogical, which could also constitute a category of their own, are not defined here, only because taking the count of this type of texts is difficult - they are usually not referenced in the West even if they reach there.

3.2.5 Contextual information

Text Encoding Initiative

Although this aim is much more modest than the building of a corpus representative of all the texts produced by a language-culture, it still necessitates a theorisation of the variation able to produce a sort of matrice against which the chosen texts may be situated.

3.2.6 Conclusion

dialect and sub-dialect
publication date
type of text

3.3. Development priorities

3.3.1 Choice of the first contents : a compromise between ideal and reality

Concerning the size of the corpus, it has already been mentioned that one million words was commonly considered minimum whatever the sought application. But this size is to look at in relation withnthe different categories of text comprising the corpus, supposing each one could be used as a base for a specific study. If a researcher is only interested into one period, or only in newspapers, he should still find in the corpus enough texts of this category or period to work.

Concerning the type of texts to include, the experience of the developping of other corpora shows that, if the researchers at the beginning took care of respecting specific proportions between the different types of texts they defined, in the name of corpus representativity, this last notion tends now to be questioned by linguistic arguments : considering the great number of "sub-languages" corresponding for instance to specific socio-professional fields, a true representativity would only be obtained through the integration of all the uses, which is obviously impossible.

So it seems that we now witness a move from representativity towards reusability - a concept we already met. And last, researchers take more and more into account the real availability of such or such type of text already in electronic form. For the "great languages" as English, this "reality principle" leads to build bigger and bigger corpora : the British National Corpus is 100 million words, whereas the predceding generation corpora were considered "big" with ten times less...

For less current languages, this same principle led to pragmatically question the notion of representativity : by lack of means, corpora had well to be built with whatever texts were at hand ! It is probable that a Kurdish corpus will be submitted exactly to this type of choice.

And last, the political situation of Kurds is in itself a parameter which must be taken into account in the choices for the corpus development. This situation generates an "enclosing phenomenon" for their language compared to other languages in the world : few studies exist on Kurdish language, be it linguistics or bilingual lexicography. Moreover, we already saw that the lack of unity of Kurdish populated areas had for consequence the use of different writing systems, which then reproduces the enclosing phenomenon to a lesser scale, between the different dialects.

So it seems that, to fight against this double enclosing, it would be important to accomplish first :

the electronic edition of bilingual texts (Kurdish dialect - another language)
For competence reasons, we would start with texts having an english or french translation, but it is clear that regional languages (Turkish, Arabic, Persian) should be also included at longer term, which could be used as the basis for a call to collaboration.
the electronic edition of Kurdish bidialectal texts, and in priority for the two "main" dialects, Kurmandjî - Soranî.
For obvious reasons of means and working force, we would first focus on the two present main dialects, Kurmandjî and Soranî, which have the more numerous texts published. Obviously, it is not any theoretical stand, only a pragmatic one, and any initiative concerning other dialects would be received with great pleasure. Here again, a call for collaboration should be envisaged, once technical and methodological questions will have been decided upon.
Simultaneously, it seems important to realise electronic editions of texts considered as "founding texts" of Kurdish culture. This may imply texts originally written in Goranî, but in conformity with the just defined proposed priorities, it would be better to first work on modern dialects editions, and with translations in french or english.

Technical Document

3.3.2 Making known the Kurdish Corpus Initiative

pre-project version

The Project itself will be elaborated after this second diffusion. But it is very clear in the mind of its authors that it is only - and only has the ambition to be - proposals for orientations, which will still be, at this stage, susceptible to be modified and amended. Indeed we wish our initiative will suscitate other initiatives, which we wish to be coordinated with ours in a way to define, only for the best overall efficiency. But in any case, it is not our aim to control anyone else's any work.

3.3.3 Make a first census of texts susceptible to be integrated into the corpus

published Kurdish texts

A first list, which already shows that those materials are finally rather numerous, will be found in Annex A, section 5.3. This preparatory census will be put together with the successive versions of the project so as to gather proposals to supplement it and critical remarks concerning the choices and criterions of choices used (defects of the chosen edition of a given text, supplementary information about other available translations etc).

3.3.4 Proposing to everyone to contribute with their knowledge of the texts

Annex

However, one important point is the attribution of an unique reference number for each text, its definition still being to decide precisely, which, as numerous technical matters, will need the elaboration of technical documents. Last, a guide of reflexion or / and conversation, sort of oral version of this form, could be devised. It woud help interested persons - be they researchers, Kurdish language speakers or both - to contribute directly or discuss with others about texts which, according to them, should definitely be incorporated first into the corpus, and about the criterions to select such texts.

3.3.5 Beginning to pragmatically gather texts

3.3.6 The question of the marking of texts

aims

structural

And last, before really thinking to constitute a full corpus, it seems essential to first produce electronic versions of some short texts. This will allow to take the measure of the conceptual and technical difficulties which are in store for the producers, and to give them a way to train themselves with the tools they will have to use. The Technical Document will give some hints about those tools.

As the knowledge of the persons taking part in the project increases, as other persons will volonteer to participate, and perhaps as specific requests concerning the use of the corpus will emerge, it should be possible to proceed to a new type of marking, using other criterions, notably encyclopedic ones (see section 1.2.3.1).

3.5. Chronology

3.5.1 Work which will have to go on during the whole project

pragmatic accumulation of texts
get documentation about the way other researchers work in the corpus field :
- how corpora are structured
- syntactic and morphologic analysis
- building of lemmas lexicons
- Part-of-Speech tagging
- processing techniques
- available software

3.5.2. July 97 to January 98

writing of a first text, draft version of the project, to do "internal diffusion"
producing a first census of available texts
beginning to ask the questions of the mode of constitution of the corpus
beginning to work on copyright problems, from the experience of other researchers (types of agreements done etc)

3.5.3 : January to June 98

producing a pre-project text which will be broadcast more openly for criticism
take in account the reactions and prepare the project itself
translate the final version and broadcast it on the net

Simultaneously :

beginning a self-training to the text gathering and marking techniques, through :
- technological survey
- internal seminaries
- producing technical notes
- experimental electronic acquisition and encoding of some texts
preparing of a interview guide with Kurdish speakers for collecting data about what they consider as interesting texts to include in the corpus
working on the ratio between the different types of texts, the dialectal and diachronic questions
work deeper on the problems of obtaining texts (copyright, understandings with publishing houses, newspaper owners etc) :
- taking first contacts with copyright owners (specially newpapers)
- make a study of the possibility of public help and sponsoring

4. Bibliography

BEDIR KHAN, Djeladet et LESCOT, Roger, Grammaire Kurde (Dialecte Kurmandji) (Kurdish grammar), Librairie d'Amérique et d'Orient, Paris, 1970.

BIBER, D., "A typology of English texts", Linguistics, no 27, 3-43, 1989.

BLAU Joyce, "Introduction", Les Kurdes et le Kurdistan, bibliographie critique 1977-1986 (Kurds and Kurdistan, a bibliography with commentaries), 1989).

ÇELIKER, Celadet, Çend Pirsên Alfabeya Kurdi, We

anên Roja Nû (Some questions about Kurdish alphabet), Stockholm, 1996.

Cobuild, Collins Cobuild English Language Dictionary, Collins, London, 1987.

DE LOUPY, Claude, "La méthode d'étiquetage d'Eric BRILL" (Eric BRILL's tagging method), T.A.L., vol. 36, no 1-2, pp. 37-46, 1995

DENY J., Grammaire de la langue turque (Grammar of Turkish language), P.U.F., Paris, 1921 (Wiesbaden, 1971).

DZIEGEL, Leszek, "Villages et petites villes kurdes dans l'Irak actuel" (Kurdish villages and small cities in present Irak), Studia Kurdica, no 1-5, 1988, pp. 127-156.

LANGE et GAUSSIER, 1995, "Alignement de corpus multilingues" (Multilingual corpus alignment), T.A.L., 1995, vol. 36, no 1-2, pp. 67-80.

MACKENZIE, D. N., Kurdish Dialect Studies, School of Oriental and African Studies, London Oriental Studies, vol. 9 & 10, S.O.A.S., 1961 (1990) & 1962 (1990).

MOKRI, M., "Le foyer Kurde" (Kurdish hearth and home), Ethnographie, Paris, 1961, pp. 79-95, reêd. in Contribution Scientifique aux Etudes Iraniennes (A scientific contribution to Iranian studies), Klincksieck, Paris, 1970.

PERY-WOODLEY, Marie-Paule, "Quels corpus pour quels traitements automatiques ?" (Which corpora for which automatic processing), T.A.L., vol. 36, no 188; 1-2, pp.213-232, 1995.

REY Alain, Le lexique : images et modèles, du dictionnaire à la lexicologie, (Lexicon : images and models, from dictionary to lexicology) Armand Colin, 1977.

RIZGAR, Baram, Kurdish-English English-Kurdish Dictionary, London, 1993.

Text Encoding Initiative, TEI P3, Guidelines for Electronic Text Encoding and Interchange, C. M. Sperberg-McQueen, Lou Burnard, eds., Chicago, Oxford, 1996.

URL : Tribune des Industries de la Langue et de l'Information Electronique, (Forum of language industries and electronic information) no 17-18-19, février-août, 1995.

URL : BALL, Cathy, Concordances and Corpora, Personal www page of Cathy BALL, e-mail : cball@guvax.georgetown.edu>, 1997.

URL : Corpus Encoding Standard http://www.cs.vassar.edu/CES/>, 1997

Annex A : First census of available texts

Preliminary reflections for the constitution of a national corpus of Kurdish language

1. Potential utility of a text corpus for Kurdish language

1.1. General Preamble

1.2. Introduction : corpora techniques

1.2.1 Basic necessities

1.2.1.1 Minimum size

1.2.1.2. Consistency and care in realization

1.2.1.3 Data representativity

1.2.2 The different types of corpora

1.2.2.1 Synchronic or diachronic corpus

1.2.2.2 Monolingual or multilingual corpus

1.2.2.3 Tagged or untagged corpus

1.2.3 Some usages of corpora

1.2.3.1 Language studies

1.2.3.2 Building of linguistic dictionaries

1.2.3.3 Applications to Computer Assisted Translation

1.2.3.4 Building of encyclopedic dictionaries

1.2.3.5 Uses that may contribute to the defense of Kurdish language

2. Context : Kurdish language and expression

2.1. General introduction

2.2. The different dialects

2.2.1 Kurmandjî

2.2.2 Soranî

2.2.3 Southern Kurdish dialects

2.2.4 Dimîlî (Zaza)

2.2.5 Goranî, the first literary standard

2.2.6 Lorî

2.3. Expression in Kurdish language

2.3.1 Introduction

2.3.2 Kurdish writing during the classical period

2.3.2 The legal status of Kurdish language in modern times

2.3.3 Kurdish modern expression

2.4. The different writing systems

2.4.1 The "Hawar" alphabet

2.4.2 The notation using Armenian alphabet

2.4.3 The different Roman writing systems of Kurmandjî

2.4.4 The cyrillic alphabet

2.4.5 The modern arabic-persian alphabet

2.4.6 The traditionnal writing of Kurdish

2.4.7. Phonetic transcription of linguistic data

2.4.8 Recapitulative table

3. Methodological proposals for realisation

3.1. Introduction

3.2. Descriptors structuring the corpus

3.2.1. For a "variable geometry" corpus

3.2.2 Classification from dialect and subdialect

3.2.3 Diachronic classification

3.2.4 Texts typology

3.2.5 Contextual information

3.2.6 Conclusion

3.3. Development priorities

3.3.1 Choice of the first contents : a compromise between ideal and reality

3.3.2 Making known the Kurdish Corpus Initiative

3.3.3 Make a first census of texts susceptible to be integrated into the corpus

3.3.4 Proposing to everyone to contribute with their knowledge of the texts

3.3.5 Beginning to pragmatically gather texts

3.3.6 The question of the marking of texts

3.5. Chronology

3.5.1 Work which will have to go on during the whole project

3.5.2. July 97 to January 98

3.5.3 : January to June 98

4. Bibliography

Annex A : First census of available texts

Preliminary reflections for the constitution of
a national corpus of Kurdish language