- Clustering by symbolic
relations
Given
a list of terms, TermWatch first identifies variation relations between
these terms and builds a graph of term variants. It then clusters the
graph of term variants in two stages. First, it builds connected
components with a subset of user-specified relations. We will call this
set "COMP" relations. Next, it
clusters the connected components using the second subset of relations
called "CLAS" relations. It is
up to the user to specify the role each relation will play (COMP or
CLAS) depending on the linguistic significance of each relation and on
the target application.
Typically, if you are
seeking to obtain
semantically coherent clusters for building knowledge representation
resources (ontologies, thesauri, taxonomies and the like), then you
should use relations that semantically-tight relations that suggest
"synonymy", hyponymy etc.
If on the
other hand you are seeking to identify the topics dealt with in the
corpus and how they are connected, you may not require that clusters
are semantically-tight. You can then "relax" the type of variations in
order to obtain
associated topics. This is typically the case if you are using
TermWatch to perform science and technology
watch, information retrieval (query expansion) or question-answering.
In any case, you may also
wish to add co-occurrence information and combine your own relations
according to your needs.
Example
of a cluster formed by symbolic relations
A component is formed at
the first stage of
clustering by grouping together terms sharing some semantically-tight
relations. Below is an example of cluster formed by four components.
Terms within a component share modifier relations "CD11b+ bone marrow cell" is a
modifiier substitution of "immature
bone marrow cell". Components are linked by head variation
relations following an edge differentiation coefficient, i.e., "bone marrow transplantation" is a
head expansion of "bone marrow".
Cluster
Comp1:
CD11b+
bone marrow cell; immature bone marrow cell; mouse bone marrow cell;
normal bone marrow cell; normal bone marrow myeloid cell; normal CD34+
bone marrow cell; transgenic bonne marrow cell murine bone marrow
cell; primary murine bone marrow cell.
Comp2:
bone marrow transplantation;
autologous bone marrow transplantation
Comp3:
bone marrow; adult bone marrow;
normal bone marrow
Comp4:
bone marrow derived macrophage;
murine bone marrow derived macrophage
What this cluster is suggesting is that research word around bone
marrow deals with the following topics (the added or substituted head
words): transplantation, cell,
macrophage, whereas the modifier relations suggest the different
"types" of bone marrow which are being studied (CD11b+, immature, mouse, transgenic,
murine, autologous, normal, adult, etc.). If available, explicit
semantic relations can be added.
The other types of symbolic relations available in TermWatch are listed
in check available
variations section.
- Clustering by
statistical information
This does not need
presenting : it's the age-old co-occurrence relation commonly used for
clustering.
In the current implementation, instead of computing
occurrences, we computed within-sentence frequencies in a limited text
window. Thus we are interested in the presence/absence of a term in a
sentence regardless of the number of times it appears in that
particular sentence. The strenght of association of
two terms is given by an equivalence coefficient formulated thus :
Eij = fij2/ fi x fj
where Eij indicates the strength of the association between term i and
term j;
fij2
= number of sentences in which terms i and j appear (current text window)
fi = number of sentences
in which term i appears
fj = number of
sentences in which term j appears
After computing the association
scores, two thresholds are distinguished : strong associations (>
0.5) and weak associations (< 0.5 >=0.05). To perform clustering,
the user can choose ''strong associations'' in role 1
(to form connected components) and weak associations in role
2 (to gather connected components into clusters). You can also set
a threshold to consider only association links above that threshold.
If you only wish to use
co-occurrence for clustering, then you can disable the other symbolic
relations (variations) by setting their role to 3 (see Clustering parameters for more
detailed instructions).
N.B.
The text window used and its size can easily be modified via the MySQL
interface.
For instance, a user may wish to search for co-occurrencies in whole
text, in a paragraph, in a fixed-sized window...
TermWatch is currently
avalaible as an online server application. To use it, you'll
need an account on the
server. For
this, contact TW-webmaster.
Once your account has been created:
Log in to the system at the following address : https://stid-bdd.iut.univ-metz.fr/TermWatch/
Type in your login and password.
The typical stages to
perform text data analysis with TermWatch are:
- Upload your corpus
into the system
- Launch the term
extraction module
- upload
your list of terms into the database
- perform
variation and clustering
- export
results to MySQL database
- navigate
among the results
We
implemented term extraction
using LTchunker and some handcrafted term
extraction rules. LTchunker is external to the system and
is disbributed freely for research puposes by the LP Group, University
of Edinburgh. We have been able to optimize TermWatch's internal
programs (variations identification and clustering) but not the term
extraction phase. So the process can take time (from 1 hour upwards for
a corpus of 455 000 words). If you are in a hurry (how could you be ?),
we suggest you extract your terms with any term extractor available to
you prior to using TermWatch.
You can also
implement your own extraction using any existing tagger and
pattern-matching
rules. If your terms are already extracted, go straight to "Uploading your terms" section
and follow from there. If however, you want to extract terms via
TermWatch, here's howto:
- upload
your corpus into the
system following the steps indicated in that section
- Select Run local term candidate extractor in
TermWatch phase menu
- Read carefully all the warnings before purchasing.
The new candidates will be automatically inserted in the terms_tbl
table.
The process could take more than one hour for big corpus of abstracts.
If this is the case of your corpus you should split corpus_tbl in
several smaller tables using SQL commands via the phpmyadmin interface.
If you want to be able to
navigate in the source texts after clustering, then you can also upload
your corpus into the database. Before filling out the import form,
check
the
structure of your corpus: identify fields and field separators, text
separators, special characters you want to conserve or remove,
etc. A headache here is making sure your corpus has regular field
separators the correct recognition of fields during import...
To
upload your corpus:
- Click on the "corpus_tbl"
on the left pane (max. size : 2 Mo). Split your corpus
into different files if its size is bigger than this size.
- If your corpus is
structured with fields (author, title, year, abstract,...), you can
decide on the fields to maintain. The
two mandatory fields are "Title"
and "abstract".
- To discard the
other
fields, tick the corresponding boxes in front of their names and click
on the drop icon "X"
just under the table.
- Go to the bottom of
the page and click the hypertext "Insert data from a textfile
into table".
- Specify your field
separator in "Fields terminated by"
- Specify your text
separator in "Lines terminated by"
- In "Fields escaped by",
specify a
control character to handle the special characters used in the other
fields.
- Hit the "Submit" button.
- This imports your
corpus into the MySQL database.
- Select phase (2) "Re-load
corpus of terms"
from the TermWatch phase,
- Hit GO!. The
system tells you it could not load the table of terms in the
database.
- Click on the
"database" hypertext in this
message. This opens the
PhPMyAdmin interface
- On the left pane,
select "terms_tbl"
table.
- If your term list is
bigger than the maximum size set (2 Mo), you have to make different
files and upload them successively into the
same table (term_tbl or corpus_tbl depending on what you
are uploading).
- Click on the menu "Insert data
from a textfile into table".
This message is found at the bottom of the current window. It is
diplayed in the chosen language of your browser.
- In the first
row of the
Input list of TermWatch table: use browse button to
select the location of the textfile
- Leave the default options
in the cells "Replace table data with
file" and "fields terminated by".
- In the row "Fields
enclosed by" , tick
the box "optionally"
- In the "LOAD
method" row, leave "DATA LOCAL"
and press "Submit". The
system will first try this method to upload terms, if this fails, an
error message will be displayed. In this case, try the "DATA" option
for uploading your terms.
- The system tags the list of terms using WordNet lemmas, this
could take some seconds or
a minute depending on the size of your corpus.
- Once you have uploaded your list of terms in the MySQL database,
return to the TermWatch window
- From the "TermWatch
phase" pane, select (2) "Re-load
corpus of terms". As
TermWatch keeps the last loaded corpus in memory, this step is called
"Re-load". It is advisable to
do "Re-load corpus of terms"
in
between sessions to make sure
you are working on your latest list of terms.
- Hit the GO!
button.
- TermWatch loads the list of terms in the MySQL database and
prints the
message "List of terms successfully
loaded from the table terms_tbl in your database using the same login
information."
- If you wish
to view the list of terms, click on
the hypertext "database".
Normally, this opens the list of terms in a MySQL PhpMyAdmin interface.
From here, you can either
run TermWatch in : default
mode or interactive
mode. Default
mode is advisable if you are not familiar
with works on
terminological variations and their significance or if this is your
first time of using the system.
N.B. It is Advisable to use this mode if you are not familiar with
works on
the terminological variations relations used by TermWatch for
clustering.
Before you run the
system on your own corpus, you have to upload your list of terms into
the MySQL database used by TermWatch (see help on uploading
terms into TermWatch).
Default parameters
- The "default
parameters" set for TermWatch were chosen from empirical evidence in
the case where
TermWatch is being used for science and technology watch. Thus
the resulting clusters may not be semantically constrained, i.e., the
terms in the
same cluster can be from different semantic categories, provided they
share some lexical elements in common.
- In the
default mode, the relations considered are the following : modifier
expansions
(left-expansion, insertion) and modifier
subsitution on terms of length >=3 ; head expansions, head
substitution on terms of length >=3. The different variation
relations available in TermWatch are explained in the "available
variations" section.
- Default
threshold is set to "0" and the number of Iterations = 2.
- Default weight is set to "1"
for the
following relations: Exp-l, Ins, M_sub_3
- Default
weight is set to "2"
for Exp-2,
Exp-r, sub_head_3.
Explanations on the
significance of the weights are given in Weight.
- Hit GO! button to run TermWatch.
N.B. In the default mode,
one-word terms like "analysis,
activity, system, blood" and binary substitution
relations like "fine
bran, coarse bran, rice bran, wheat bran, defatted bran" will be
ignored. If you
need to consider them, then you should run TermWatch with your own
parameters.
In this mode, the user can select:
- the precise TermWatch module s/he wants to run,
- the relations used for clustering,
- the role each one plays,
- their weights
- the number of iterations of the clustering algorithm.
Step 3: Variation extraction and clustering
- Return to the TermWatch Interface.
- From the "TermWatch phase"
pane, choose (step
3) "Process variation extraction and
clustering"
- Hit
the GO!
This opens another window
with two major zones: Programs and
parameters, statistics on previous execution. Before running
step (3),
the "Programs
and Parameter"
pane has to be set.
Programs and parameters
Select the program
This pull-down menu shows
the available TermWatch modules. At the initial stage, the "check
available variation" pane is
empty. You have
to run the variation relations successively in order for the "available
variations" to appear in the right
pane. The modules available in this menu are:
- Overall : this texecutes the whole
process
of variation and clustering in default mode.
- step1_var_exp searches for
expansion, insertions and spelling variants
- step1_var_sub searches for lexical substitution variants
- step1_var_sub_wn searches for
WordNet substitution variants (semantic variants)
- step2_cpcl computes the connected
components and the clusters.
- var_reset reinitializes the
variation search and empties the right hand pane.
Important
You have to run each program separately by selecting its name and
hitting the GO! button.
The step1 programs do not neet
to set the clustering parameters
areas. So leave this zone if you are only interested in running the
variations.
Executions
takes...a few
seconds (normally ;-)).
Once a variation program
is executed, the available
variation pane is
filled up with the corresponding variation types. Repeat this process
for all the programs you want to run.
If you run the
clustering program (step2_cpcl),
then you MUST also set the clustering
parameters. This is explained below.
Clustering parameters
Important
Threshold
Threshold is the minimal
weight
at which relations between connected components can be considered for
clustering.
- a threshold of "0" means all links are considered.
- a threshold
of
0.01 means links of that level and above are considered, etc.
Experimentally, you can try several links.
Iterations
Specify here the number of
times the
clustering algorithm iterates the second level of clustering (i.e.
grouping connected components into clusters). Since TermWatch is
based on hierarchical algorithm, the more the iterations, the bigger
the size of the clusters, i.e., clusters are merged in subsequent
iterations
or new components are integrated.
Different values work for
different corpora. You can only determine your ideal value
experimentally. Let's just say that we have often got optimal results
at the 2nd iteration...
Once the program,
the threshold and iterations are set, hit the GO! button below.
Important
For clusters to be computed,
you must run step2_cpcl. However, if you wish to
obtain only paradigmatic classes of terms (groups of terms with the
same head word), run the step1_var programs only.
Check available variations
in
TermWatch
There
are many ways in which these relations can be presented. For reasons of
efficiency, we will present them according to the two stages of
clustering in the TermWatch algorithm. For a
linguistically-motivated presentation, see Fidelia Ibekwe-SanJuan's homepage.
Weight
This determines how the
clustering algorithm should "weigh" the total number of each variation
type in the graph. Weight
takes two values, either "1" or "2".
Weight=1 means the total number of
relations is taken
"as is" during clustering, i.e, if there are 500 insertions in the
graph, then this number is used as is in computing the strength of the
links between components.
Weight=2 means the system will take
the inverse of the
total number of that particular type of relations. If
your graph has 200 sub_head_3 relations, then only half of this number
will be used in the index for weighting variation links between two
components. The idea is to handicap very prolific
relations like lexical modifier or head substitutions which may "drown"
the more rarer
variation types, thus making the information they carry invisible in
the final
clusters.
In essence, you should assign a weight of
"1" to relations
that are of prime importance to your target application and assign a
weight of "2" to more subsidiary relations.
Default weights and roles have
been set. You could first try them and evaluate the results.
Role
Relations used for clustering can be asigned
a role of 1, 2 or 3.
1 = relation is used at the first
level of clustering to form connected components (COMP)
2 = relation is used for clustering
connected components to obtain the topic clusters
3 = the relation is ignored
Role "1" (COMP) relations :
these are typically the relations which you want to
assign a prime role. They will form groups of related term at the first
level i.e connected components in formal terms. Experimentally, we have
chosen in different experiments, the following relations as COMP :
Ins
(Insertion) : wheat germ effects
/ wheat germ enrichment effects
Exp-l
(Left-expansion) : flour
fractionation / wheat flour
fractionation
spelling
: on-line database / online
database
sub_modifier_2 : fine bran / edible
bran
sub_modifier_3: raw wheat germ
/ stabilized wheat germ
sub_head_wn (WordNet head substitutions
from the same synset): immune reactivity / immune responsiveness.
Role "2" (CLAS) relations :
these will cluster connected components using a hierarchical clustering
algorithm. Here, the weight
assigned to each variation type will be taken into consideration for
computing the strength of the link between two components. Relations
which could be considered at this stage are:
Exp-r
(head-expansion) :
wheat bran / wheat bran incorporation
Exp-2
(head-modifier
expansion) : rye bran /
wheat flour rye bran
supplementation
sub_head_2 (Head-substitution on binary terms): flour fractionation /
flour type
sub_head_3 (Head-substitution
on terms
>=3): wheat flour fractionation / wheat flour
supplementation
These relations also
work between terms of two syntactic structures, for instance "query
language access / acces to
structured query language".
Example of a default
setting for variations
"weight" and "role"
Variation |
exp-2 |
exp-l |
ins |
spelling |
sub_head2 |
sub_head3 |
sub_head_wn |
sub_mod2 |
sub_mod3 |
sub_mod_wn |
Weight |
1 |
1 |
1 |
1 |
2 |
2 |
1 |
2 |
2 |
1 |
Role |
2 |
1 |
1 |
1 |
3 |
2 |
1 |
3 |
2 |
1 |
N.B.
- Relations
with role set to "3" are ignored in the clustering.
- WordNet substitutions are semantic variants, and are thus more
meaningful than the lexical variants.
- The user could make a distinction
between these two categories by using only semantic
substitutions at the COMP level (role=1), then using the lexical
substitutions
in the second stage of clustering (role=2).
- Lexical substitutions can
be suitable for identifying co-hyponyms of a parent concept (edible
bran, fine bran are both
types of "bran") although this
can sometimes lead to noise.
- Also, binary substitutions
are very prolific. It is advisable to check the result of adding them
in the final clusters built. The
user can run the program again and can eventually deselect them by
assgning them the role of 3.
- Note also that
the
variations are
constrained in that added words have to be consecutive to avoid
generating
"accidental relations", although this possibility is not entirely ruled
out when using the default mode. In this mode, no semantic variant
is included .
- These choices have been
made empirically. They may not
be ideally suited for your corpus. You have to test different values
untils cluster contents suit your application needs (see step4
and step5).
Finally,
the choices you make depend
on the target application.
- Once
the step2_cpcl program, the
clustering parameters, the variation weights and roles have been
specified,
- hit
the GO! button under this pane
(bottom of the page) for the system to build the clusters. This
re-initializes the
"Statistics on previous execution" area.
Statistics on
previous execution
This zone shows the statistics
issuing from the last execution of TermWatch.
Statistics
on
variation execution
This shows the
number of terms and relations
found for each variation type;
Statistics on CPCL
clustering
Shows the same for the clustering :
the total number of terms in
clusters, total number of components, the maximum size of a component;
total number of clusters and size of biggest cluster.
It is good to take note of these figures as they will indicate if the
results are meaningful or not. Especially, take note of the size of the
biggest cluster ("Max"
column), number of clusters, total number of terms in clusters.
At this stage, you can either view the results of the clustering or
redo the clustering using other parameters.
The "parameters,
variations,
Weight & Role"
tables recall the parameters used to
obtain these results
(see Programs
and parameters
section).
You have to export the
results of the clustering to the database before they can be explored.
- For this, go to the
top of the page.
- From the pull-down
menu, select (4) Export results to
the MySQL database,
- From the line "What do you want to export", select
"send all results" to view the
variations and clustering; "send only
the variations used for clustering" or "only clustering results" (only
those connected components that are integrated into clusters).
- hit GO! A message is diplayed to tell
you that your results have been successfully exported to the database.
At this stage, you can
view either the results via the PhPMyAdmin interface if you wish or
navigate them via TermWatch's interface.
For the former, click on the hypertext "database" to go over to the
database and click on the "cluster_tbl". For the latter, see step 5
below.
Select this option
from the TermWatch phase pull-down menu and hit GO!.
This brings you to a window where you have two possibilities for
navigating the results:
- through the html
interface via the "Class"
table, this enables you to view class contexts in navigational mode;
- Through the Aisee
visualization interface. This will need
installing Aisee on your computer, getting a
license (free for research purposes and can be got in 1 day!) and
uploading the "GDL file which
is the required format for this graphic tool. N.B."
We recommend you try both
as Aisee offers a graphic and suggestive display of the map of
clusters. This is quite useful if you want to grasp the layout of
research topics (usually
more fun than lists !), also relevant if you are
using TermWatch for science and technology watch. Well, let's start
with the navigational mode first.
Navigational mode
This is immediately
available in the "Class" table. Clusters are
presented in descending order of size.
- To view a cluster's
content, select the cluster (radio button in front) and hit the nearest
GO! button.
- TermWatch displays
the cluster contents: first the connected components in the cluster
with their size.
- You can click a connected component to view the
terms within and the total number of variants for that term.
- Select a term and hit GO! , TermWatch generates a four
column table displaying the following information:
- the term's different variants,
- the components in which it occurs,
- the clusters which contain the components,
- the variation type
Navigation operates
on three hierarchical levels: cluster, component and term levels. Any
of the information here
is clickable:
- a click on a
variant + GO! displays the
variants of this term and the documents in which the term
appeared. The term is underlined in bold
- You can do
multiple selections but only
across different columns. You can select only one item per
column.
- For instance,
select a term variant, a component and a cluster and hit the GO!
button. However, TermWatch will
only consider the rightmost selection, i.e. will only display the
cluster selected.
- Hit RESET to return to the list of clusters
-
Alternatively, you may want to navigate results by searching specific
terms. Enter the keyword in the search zone on top and hit GO!
- Depending on the results, you may wish to re-do the clustering
with different parameters.
Graphic display mode
For this you need to install the
Aisee visualization tool on your
computer.
TermWatch outputs a "GDL" file
which is ready to be loaded onto the Aisee interface. Here are very
succint hints on using Aisee to navigate TermWatch's clusters. More
complete information on Aisee's functions can be found its the website.
- After installation, launch Aisee from your desktop.
- From TermWatch's interface, click on "gdl file" in the hypertext message
"You can download the gdl file to be
vizualized using AiSee." on the first page showing list
of clusters (use RESET to return to this page).
- The gdl file is loaded in a browser window. Save this file on
your computer.
- Return to the Aisee interface and click the Open_File menu (or corresponding
icon), load the file.
- Aisee displays the message "Warning:
graph is not connected. Components will be divergent. Force directed
placement will require maximal time". Just ignore it.
- Click "Toggle text window/graph
window" icon (second from right). This actually displays the
graph of clusters. Press OK to "switch
to graph window" message.
- You can also display the Aisee panner which enables you to
position the cursor on a particular zone of the graph. The panner
remains active until you de-activate this option.
- Use the horizontal bar to regulate the zoom: moving it to your
left reduces the size of the display thus enabling you see the whole
layout and vice versa.
- Displaying information on
clusters: To display information on a cluster, first select the
cluster, then hit one of the letters printed on the message bar of
Aisee. This will highlight the required information. You can click on
other clusters to highlight the same type of information. Right click
to delesect the function.
- Fold/unfold clusters: To
unfold a cluster, select the cluster, then the Folding menu, chose "Wrap/unwrap graph".
- You will get the same message as above ("Warning: graph is not connected.
Components will be divergent. Force directed placement will require
maximal time"). Ignore it and click "Toggle text window / graph window"
to see the unfolded graph. Use the panner to position the cursor on the
area (sometimes you have to search for it a bit...).
- The unfolded graph shows the components contained in the graph
and how they are linked. The idea is to view the internal structure of
a cluster.
- You can display the same types of information as on a cluster:
number of terms, most active variants. Just click on the component, hit
one of the letters displayed below. Right click to deselect.
- To fold a cluster, go
through the same menu as for unfolding.
- Expose/hide links: To
avoid the image being cluttered, you can choose to view only clusters
with outgoing links above a certain threshold. In our adaptation of
Aisee, we specified four threshold for outgoing links.
- To deselect these links,
go to Folding menu, choose "Expose/hide links" and
de-select "1/4 weakest links"
and OK.
- If the image is still cluttered, you can try deselecting "1/3
weakest links", and so on...
until you are satisfied with the image.
- Center node: To position
on a specific cluster, go to "Position_center
node" menu, scroll down to the class or component required.
Click center. Aisee positions
the cursor on that element. Ok
to close the window.
- Follow edges: In a
cluttered graph, it is sometimes useful to follow all the links of a
cluster. To do this, select a cluster, click "Position_follow edge" menu. A link
will be outlined in bold, continue clicking on the cluster to view its
other links
- When positioned on a link and you wish to view the links from the
cluster pointed to, right click. This second cluster becomes the
starting point to view external links.. and so on. To de-activate this
function, right click twice.
- Exporting Aisee graphs:
It is also possible to export parts of whole of Aisee graphs as image
files.
- For this, set the horizontal bar at the required level of
legibility.
- Click "File_Print export part"
or "Print/export graph"
depending.
- Set the parameters as follows for a white/black image:
- tick "Maxspect"
- color mode: black+white
- orientation: portrait (or other)
- Paper size: choose yours
- OK button. Choose file
name and "ps" (postscript)
type.
- Open the file with Ghostview (installed on your computer of
course !)
- Use the "File_convert"
option to convert image either to jpg (recommended) or to other image
formats. Set the resolution and OK.
THE END...