SYMBOLIC DATA ANALYSIS:
A MATHEMATICAL FRAMEWORK AND TOOL FOR DATA MINING*.
Edwin DIDAY
University Paris 9 Dauphine, Ceremade. Pl. Du Mle de L. de Tassigny. 75016 Paris, FRANCE
Summary: knowledge extraction from large data bases is called "Data Mining". In Data Bases the descriptions of the units are more complex than the standard ones due to the fact that they can contain internal variation and be structured. Moreover, symbolic data happen from many sources in order to summarise huge sets of data. They need more complex data tables called "symbolic data tables" because a cell of such data table does not necessarily contain as usual, a single quantitative or categorical values. For instance, a cell can contain several values linked by a taxonomy . The need to extend standard data analysis methods (exploratory, clustering, factorial analysis, discrimination,...) to symbolic data table is increasing in order to get more accurate information and summarise extensive data sets contained in Data Bases. We define "Symbolic Data Analysis" (SDA) the extension of standard Data Analysis to such tables. "Symbolic objects" are defined, in order to describe in an explanatory way classes of such units. They constitute an explanatory output of a SDA and they can be used as queries of the Data Base. A symbolic object is "complete" if its "extent" covers exactly the class that it describes. The set of complete symbolic objects constitutes a Galois lattice. The SDA tools developed in the European Community project "SODAS" are finally mentioned.
Key words: Symbolic Data Analysis, Conceptual Analysis, Complex data, Data Mining,
Partitioning, Hierarchical and Pyramidal clustering, Lattices.
1) Introduction
In a census of a country, each individual of each region is described by a set of numerical or
categorical variables given in several relations of a Data Base. In order to study the regions, we can describe each of them summarising the values taken by the individuals which live in, by means, intervals, subsets of categorical values, histograms, probability distributions, depending on the concerned variable. In such a way we obtain a "symbolic data table" where each row defines the "description" of a region and each column is associated to an original variable. An extension of standard Data Analysis to such data table is one of the aim of what we have called "Symbolic Data Analysis".
Another important aim is to obtain (or "mine") explanatory results by extracted, which we have called "symbolic objects". The description of a region is called "intent", the set of
*IFCS'98, Springer Verlag.
individuals which satisfy this intent is called "extent". A "symbolic object" associated to a
region is defined by its intent and by a way of finding its extent. Its syntax has an explanatory power. For
instance, the symbolic object defined by the following expression:
a(w) = [age(w) [30, 35] ] [Number of children(w) 2]. It gives at the same time the intent of a class of individuals by the description d = ([30, 35], 2} and allows to calculate its extent. It means that an individual "w" satisfies this intent if his age is between 30 and 35 years old
and he has less than 2 children. If we have a set of such classes of individuals, their associated descriptions define a higher level symbolic data table. Using this new data table we can obtain classes (of classes) which can be associated to higher level symbolic objects. For instance, the following symbolic object b(w) = [age(w) [25, 40] ] [Number of children(w) 3] contains in its extent the preceding class of individuals considered as a higher level unit. In this paper, we show how Data Bases constitute a natural source of symbolic data tables. Thus symbolic objects are introduced in order to describe classes of units described by rows of such data tables. We give a formal definition of symbolic objects and some properties of them. Finally, some tools developed in the "Sodas" European Project are presented.
2)The input of a symbolic data analysis
"Symbolic data tables" constitute the main input of a Symbolic Data Analysis . They are defined in the following way. Columns of the input data table are “ variables ” which are used in order to describe a set of units called "individuals". Rows are called “ symbolic descriptions ” of these individuals because they are not as usual, only vectors of single quantitative or categorical values. Each cell of this “ symbolic data table ” contains data of different types:
(a) Single quantitative value : for instance, if “ height ” is a variable and w is an individual : height(w) = 3.5. (b) Single categorical value: for instance, Town(w) = London.
(c) Multivalued: for instance, in the quantitative case height(w) = {3.5, 2.1, 5} means that the height of w can be either 3.5 or 2.1 or 5. Notice that (a) and (b) are special cases of (c).
(d) Interval: for instance height(w) = [3, 5], which means that the height of w varies in the interval [3, 5].
(e) Multivalued with weights: for instance a histogram or a membership function (notice that (a) and (b) and (c) are special cases of (e) when the weights are equal to 1).
Variables can be: (g) Taxonomic: for instance, “ the colour is considered to be "light" if it is "yellow", "white" or "pink" . (h) Hierarchically dependent : for instance, we can describe the kind of computer of a company only if it has a computer, hence the variable “does the company has computers? “ and the variable “ kind of computer” are hierarchically linked.
(i) With logical dependencies, for instance: “ if age(w) is less than 2 months then height(w) is less than 10 ”.
3) Source of Symbolic Data:
Symbolic data are generated when we summarise huge sets of data from many sources. They result from expert knowledge (scenario of traffic accidents, type of unemployment, species of insects ...), from the probability distribution , the percentiles or the range of any random variable associated to each cell of a stochastic data table, from Data Analysis (factorial analysis, clustering, neural networks, ...) of standard data table in order to summarise and explain the results, from time series (in describing intervals of time), from confidential data (in order to hide the initial data by less accuracy), etc. They result also, from Relational Data Bases in summarising the answer to a query, or during the study of a set of units whose description needs the merging of several relations as is shown in the following example.
Example: We have two relations of a Relational Data Base defined as follows. The first one called "delivery" is given in table 1. It describes five types of deliveries characterised by the name of the supplier, its company and the town from where the supplying is coming.
Delivery

Supplier

Company

Town

Liv1

F1

CNET

Paris

Liv5

F3

EDF

Clamart

Table 1 Relation "Delivery"
The supplying are described by the relation "Supplying" defined in the following table 2.
Supplying

Supplier

Town

FT1

F1

Paris

FT5

F3

Clamart

Table 2: Relation "Supplying"
From these two relations we can deduce the following data table 3, which describes each supplier by his company, his supplying and his towns:
Supplier

Company

Supplying

Town

F1

CNET

FT1, FT3

½ Paris, ½ Lannion

F3

EDF

FT4, FT5

Clamart

Table 3: Relation "Supplier" obtained by merging the relations "Delivery" and "Supplying".
Hence, we can see that in order to study a set of suppliers described by the variables associated to the two relations we are naturally required to take account of the four following conditions which characterise symbolic data:
i) Multivalued: this happen when the variables "Supplying" and "Town" have several values as shown in the table 3.
ii) Multivalued with weights: this is the case for the towns of the supplier F1. The weights ½ means that the town of the supplier F1 is Paris or Lannion with a frequency equal to ½.
iii) Rules: some rules have to be given as input in addition to the data table 3. For instance, "if the town is Paris and the supplier is CNET, then the supplying is FT1.
iv) Taxonomy: by using regions we can replace for instance {Paris, Clamart} by " Parisian Region ".
4) Main output of Symbolic Data Analysis algorithms:
Let be a set of individuals, D a set of descriptions, “ y ” a mapping defined from into D which associates to each w a description d D from a given symbolic data table. We denote by R, a relation defined on D. It is defined by a subset of DD. If (x,y)DxD we say that x and y are connected by R and this is denoted by xRy . The characteristic mapping of R is
h_{R}: DD {0,1} such that h_{R}(x,y) = 1 iff xRy. We generalise the mapping h_{R} by the mapping H_{R}: DxD L and we denote [d' R d]= H_{R}(d,d') the result of the "comparison" of d and d' by H_{R}. We can have L ={true, false}, in this case [d' R d] = true means that there is a connection between d and d'. We can also have L= [0, 1] if d is more or less connected to d'. In this case, [d' R d] can be interpreted as the "true value" of xRy or " the degree to which d is in relation R with d' (see in Bandemer and Nather (1992), the section 5.2 on fuzzy relations).
For instance, R {=, , , } or is an implication, a kind of matching, etc. R can also use a set of such operators.
We say that a description d D is "coherent" if it is compatible with the conditions (h) and (i) (defined in section 2). For instance, a description which expresses the fact that "sex" is "male" and "number of deliveries" is 1, is not coherent. These descriptions constitute a set of objects to which any symbolic data analysis algorithm can be applied. That is why, we use the word “ object ” to denote any coherent description . A coherent description of an individual, is called “ individual object ”. It can be also the description of a class of individuals, and in this case it is an "intensional object". For instance, the coherent description of a scenario of accidents, of a class of failures, etc. is an intensional object. A “ symbolic object ” is defined both by an object and a way of comparing it to individual objects. More formally, its definition is:
Definition of a symbolic object
A symbolic object is a triple s = (a, R, d) where R is a relation between descriptions, d is a description and “a” is a mapping defined from in L depending on R and d.
Symbolic Data Analysis concerns usually classes of symbolic objects where R is fixed, "d" varies among a finite set of coherent descriptions and a(w) = H(y(w),d) where H is a mapping DxD L. For instance, a(w) = [ h(y(w)) R d] where h can be for instance, a filter (see an example thereunder). There are two kinds of symbolic objects:
 “ Boolean symbolic objects ” if [y(w) R d] L = {true, false}. In this case, if y(w) = (y_{1},...,y_{p}), the y_{i} are of type (a) to (d), defined in section 1.
Example:
y = (colour, height), d = ({red, blue, yellow}, [10,15] ), colour(u) = {red, yellow},
height(u) = {21}, R' = ( , ), a(u) = [colour(u) {red, blue, yellow}] [height(u) [10,15] ]= true false = true. If h(y(w)) = (colour(w), ), h is a filter as a(w) = [h(y(w)R d] = [colour(w) {red, blue, yellow}] [ [10,15]]= [colour(w) {red, blue, yellow}] and we get also a(u) = true.
 “ modal symbolic objects ” if [ y(w) R d] L = [0,1]. In this case, the y(w) are of type (e). Thereunder, we give an example of such symbolic object.
Syntax of symbolic objects in the case of "assertions":
If the initial data table contains p variables we denote y(w) = (y_{1}(w),..., y_{p} (w)), D = (D_{1},...,D_{p}), d D: d = (d_{1},..., d_{p}) and R' = (R_{1},...,R_{p}) where R_{i} is a relation defined on D_{i}. We call “ assertion ” a special case of a symbolic object defined by s = (a,R,d) where R is defined by [ d'_{ }R_{ }d_{ }] = _{ i =1, p } [ d'_{ i} R_{ i} d_{ i }] and "a" by: a(w) = [ y(w) R_{ } d]. Notice also that considering the expression a(w) = _{ i =1, p } [ y_{i }(w) R_{ i} d_{ i }] we are able to define the symbolic object s=(a,R,d). Hence, we can say that this explanatory expression defines a symbolic object called "assertion".
For example, a Boolean assertion where SPC means “ socioprofessionalcategory ” is:
a(w) = [age(w) {12, 20 ,28}] [SPC(w) {employee, worker}]. If the individual u is described in the original symbolic data table by age(u)={12, 20} and SPC_{ }(u) = {employee } then: a(u) = [{12, 20 } {12, 20 ,28}] [{employee}{employee, worker}]= true
If the variables are multivalued and weighted, an example is given by:
a(w) = [age(w) R_{1} {(0.2)12, (0.8) [20 ,28]}] [SPC(w) R_{2} {(0.4)employee, (0.6)worker}] where for instance the "matching" of two probability distributions is defined for two discrete probability distributions r and q of k values by: r R_{i} q = _{j=1,k} r_{ j} q_{ j} e ^{(r}_{ j}^{  min (r}_{ j}^{, q}_{ j}^{))} .
Extent of a symbolic object s: in the Boolean case, the extent of a symbolic object is denoted Ext(s) and defined by the extent of a, which is: Extent(a) = {w / a(w) = true}. . In the modal case, given a threshold , it is defined by Ext (s)= Extent (a)= {w / a(w) }.
Order between symbolic objects: if r is a given order on D, then the induced order on the set of symbolic objects denoted by r_{s }is defined by : s_{1} r_{s} s_{2 }iff d_{1 }r d_{2}.
If R is such that [d R d']= true implies d r d', then Ext(s1) Ext(s2) if s1 r_{s} s2 . If R is such that [d R d']= true implies d' r d then Ext(s2) Ext(s1) if s1 r_{s} s2.
Tools for symbolic objects: Tools between symbolic objects (Diday (1995)) are needed such as similarities (F. de Carvalho (1998), K.C. Gowda (1998)), matching, merging by generalisation where a tnorm or a tconorm (Schweizer, Sklar (1983) and Diday, Emilion (1995)) denoted T can be used, splitting by specialisation (Ciampi et al. (1995)). Under some assumption on the choice of R and T it can be shown that the underlying structure of a set of symbolic objects is a Galois lattice (Brito(1994), Polaillon, Diday (1997)), where the vertices are closed sets defined by “ complete symbolic objects ”. More precisely, the associated Galois correspondence is defined by two mappings F and G:
F: from P() (the power set of ) into S (the set of symbolic objects) such that F(C) = s where s = (a, R, d) is defined by d = T_{cC} y(c) and so a(w) = [y(w) R T_{cC} y(c)], for a given R. For example, if T_{cC} y(c) = _{cC} y(c) , R “ ”, y(u) = {pink, blue}, C = {c, c’}, y(c) = {pink, red}, y(c’) = {blue, red}, then
a(u) =[y(w) R T_{cC} y(c)] = [{pink, blue} {pink, red}{blue, red}})={pink, red, blue}] = true and u Ext (s).
G: from S in P() such that: G(s) = Ext (s).
A “ complete symbolic object ” s is such that F(G(s)) = s. Such objects can be selected from the Galois lattice but also, from a partitioning, a hierarchical or a pyramidal clustering, from the most influential individuals to a factorial axis, from a decision tree, etc.
We can observe four kinds of advantages in the use of symbolic objects. First, they give a summary of the original symbolic data table in an explanatory way, (i.e. close to the initial language of the user) by expressing descriptions based on properties concerning the initial variables. Second, they can be easily transformed in term of query of a Data base. Third, by being independent of the initial data table they are able to identify any matching individual described in any data table. Fourth, they are able to give a new symbolic data table of higher level on which a symbolic data analysis of second level can be applied.
5) What standard statistical methods are not able to do which symbolic data analysis can do?
Starting as input with a symbolic data table.
Using generalisation processes during the algorithmsGiving explanation of the results in a language close from the one of the user by symbolic objects as output.
 Giving graphical description taking account on the internal variation of the symbolic objects.
In the European Community Project SODAS (“ Symbolic Official Data Analysis System”) the following methods are developed by its members:
 Principal Component, Correspondence Analysis and Discriminate Factorial Analysis of a symbolic data table,
 extension of elementary descriptive statistic (central object, histograms, dispersion, codispersion, etc. from a symbolic data table) to symbolic data.
 mining symbolic objects from the answers to queries of a relational data base ,
 partitioning, hierarchical or pyramidal clustering of a set of individuals described by a symbolic data table such that each class be associated to a complete symbolic object.
 dissimilarities between boolean or probabilistic symbolic objects,
 extension of decision trees on probabilistic symbolic objects, extension of a Parzen discrimination method to classes of symbolic objects,
 generalisation by a disjunction of symbolic objects of a class of individuals described in a standard way.
 interactive and ergonomic graphical representation of symbolic objects.
Conclusion
Our general aim can be stated in the following way: mining symbolic objects in order to summarise huge data sets and then, analyse them by Symbolic Data Analysis. As the underlying lattice of symbolic objects often becomes too large, other methods which provide also symbolic objects have to be used. The need to extend standard data analysis methods (exploratory, clustering, factorial analysis, discrimination,...) to symbolic data tables is increasing due to the expansion of information technology. This need, has led to an European Community project called SODAS (Hebrail (1996)) for a “ Symbolic Official Data Analysis System” in which 15 institutions of 9 European countries are concerned. Three Official Statistical Institutions are involved in this project: EUSTAT (Spane), INE (Portugal) and ONS (England). An example of application proposed on their Census data consists in finding clusters of unemploymed people and their associated mined symbolic objects in a country, calculating its extent in the census of another country and describing this extent by new symbolic objects in order to compare the behaviour of the two countries.
References
Brito P. (1994) "Order structure of symbolic assertion objects". IEEE TR. on Knowledge and Data Engineering Vol.6, n° 5, October.
Bandemer H., Nather W. (1992) "Fuzzy Data Analysis". Kluwer Academic Publisher.
De Carvalho F.A.T. (1998) "New metrics for constrained boolean symbolic objects" Proc. KESDA'98, Eurostat. Luxembourg.
Ciampi A., Diday E., Lebbe J., Périnel E., Vigne (1995) R. " Recursive partition with probabilistically imprecise data". OSDA'95. Editors: Diday, Lechevallier, Opitz Springer Verlag (1996).
Diday E., Emilion R. (1995) "Lattices and Capacities in Analysis of Probabilist Objects". Proceed. of OSDA'95 (Ordinal and Symbolic Data Analysis). Springer Verlag Editor (1996).
Diday E. (1995) " Probabilist, possibilist and belief objects for knowledge analysis " .Annals of Operations Research . 55, 227276.
Gowda K.C., Diday E. (1992) "Symbolic clustering using a new similarity measure". IEEE Trans. Syst. Man and Cybernet. 22 (2), 368378.
Hebrail G. (1996) " SODAS (Symbolic Official Data Analysis System) ". Proceedings of IFCS’96, Kobe , Japan. Springer Verlag.
Pollaillon G., Diday E. (1997) " Galois lattices of symbolic objects " Rapport du Ceremade University Paris9 Dauphine (February).
Schweizer B. , Sklar A. (1983) " Probabilist metric spaces ". Elsever NorthHolland, NewYork. 