TermWatch's homepage

TermWatch's help


Table of Contents

  1. Introduction Overview of the system
  2. Getting started with TermWatch
  3. 5. Running TermWatch in default mode
  4. 6. TermWatch in interactive mode
  5. Export results to the MySQL database
  6. Navigate among the results

Introduction

Termwatch is a clustering system offering two clustering principles : symbolic and statistical.
In its original design, TermWatch primarily focuses on clusterinfg meaningful domain terms (no limit on the length of multi-word terms) based on different linguistic relations (lexical, syntactic and semantic). The aim is to identify the main topics contained in a corpus of texts. To achieve this, TermWatch integrates research from three fields: NLP (terminology engineering), exploratory data analysis (clustering algorithms) and information visualization.
Alternatively (and to satisfy all needs ;-)), co-occurrence information can be added in order to detect associations based on the co-presence of domain terms in the same texts.
This makes TermWatch a comprehensive text analysis platform integrating both statistical and symbolic relations.

When using solely the symbolic relations, for TermWatch to find the clusters, the terms have to share some variation relations (i.e., the lexical, syntactic and semantic relations upon which clustering is based). In the ideal case, the corpus of terms given to TW should be coherent (deal with a domain or a field) but there can be several sub-specialties or topics within the same field. There is no point giving TW little bits of texts that have no connections from one another or. It may simply not find the required relations and thus be unable to find interesting clusters. In other words, for TW to be efficient on the linguistic relations, the terms in the corpus should share some lexical elements (common words) or be semantic variants (synonyms, hypernyms/hypernyms). Semantic variants need an external resource. In the current implementation, WordNet is used for this purpose but a user having a specialised thesaurus or ontonlgy can make use of this.
The user is free to choose the exact relations (symbolic or statistical) to use at a particular stage of clustering.

Target Applications

TermWatch is ideally suitable for performing science and technology watch, text mining and other knowledge intensive tasks such as ontology population from texts, knowledge acquisition from texts, terminology structuring, thesaurus building and maintenance. It can also be useful in Question-Answering (Q-A), information extraction (IE), and Information retrieval (IR ) systems, especially for query expansion.
Different publications have been made on this system. We refer interested readers to Fidelia Ibekwe-SanJuan's homepage or Eric SanJuan's page.
We'll be happy to hear of your experiments with TermWatch and what other applications you used it for.


Overview of the system

Given a list of terms, TermWatch first identifies variation relations between these terms and builds a graph of term variants. It then clusters the graph of term variants in two stages. First, it builds connected components with a subset of user-specified relations. We will call this set "COMP" relations. Next, it clusters the connected components using the second subset of relations called "CLAS" relations. It is up to the user to specify the role each relation will play (COMP or CLAS) depending on the linguistic significance of each relation and on the target application.
Typically, if you are seeking to obtain semantically coherent clusters for building knowledge representation resources (ontologies, thesauri, taxonomies and the like), then you should use relations that semantically-tight relations that suggest "synonymy", hyponymy etc.
If on the other hand you are seeking to identify the topics dealt with in the corpus and how they are connected, you may not require that clusters are semantically-tight. You can then "relax" the type of variations in order to obtain associated topics. This is typically the case if you are using TermWatch to perform science and technology watch, information retrieval (query expansion) or question-answering.
In any case, you may also wish to add co-occurrence information and combine your own relations according to your needs.

Example of a cluster formed by symbolic relations

A component is formed at the first stage of clustering by grouping together terms sharing some semantically-tight relations. Below is an example of cluster formed by four components. Terms within a component share modifier relations "CD11b+ bone marrow cell" is a modifiier substitution of "immature bone marrow cell". Components are linked by head variation relations following an edge differentiation coefficient, i.e., "bone marrow transplantation" is a head expansion of "bone marrow".

Cluster
Comp1: CD11b+ bone marrow cell; immature bone marrow cell; mouse bone marrow cell; normal bone marrow cell; normal bone marrow myeloid cell; normal CD34+ bone marrow cell; transgenic bonne marrow cell  murine bone marrow cell; primary murine bone marrow cell.
Comp2: bone marrow transplantation; autologous bone marrow transplantation
Comp3: bone marrow; adult bone marrow; normal bone marrow
Comp4: bone marrow derived macrophage; murine bone marrow derived macrophage

What this cluster is suggesting is that research word around bone marrow deals with the following topics (the added or substituted head words): transplantation, cell, macrophage, whereas the modifier relations suggest the different "types" of bone marrow which are being studied (CD11b+, immature, mouse, transgenic, murine, autologous, normal, adult, etc.). If available, explicit semantic relations can be added.

The other types of symbolic relations available in TermWatch are listed in check available variations section.


This does not need presenting : it's the age-old co-occurrence relation commonly used for clustering.
In the current implementation,
instead of computing occurrences, we computed within-sentence frequencies in a limited text window. Thus we are interested in the presence/absence of a term in a sentence regardless of the number of times it appears in that particular sentence. The strenght of association of two terms is given by an equivalence coefficient formulated thus :
 
Eij = fij2/ fi x fj
where Eij indicates the strength of the association between term i and term j;
fij2 = number of sentences in which terms i and j appear (current text window)
fi = number of sentences in which term i  appears
fj =
number of sentences in which term j  appears

After computing the association scores, two thresholds are distinguished : strong associations (> 0.5) and weak associations (< 0.5 >=0.05). To perform clustering, the user can choose ''strong associations'' in role 1 (to form connected components) and weak associations in role 2 (to gather connected components into clusters). You can also set a threshold to consider only association links above that threshold.
If you only wish to use co-occurrence for clustering, then you can disable the other symbolic relations (variations) by setting their role to 3 (see Clustering parameters for more detailed instructions).

N.B.
The text window used and its size can easily be modified via the MySQL interface.
For instance, a user may wish to search for co-occurrencies in whole text, in a paragraph, in a fixed-sized window...


Getting started with TermWatch

TermWatch is currently avalaible as an online server application. To use it, you'll need an account on the server. For this, contact  TW-webmaster. Once your account has been created:
  • Log in to the system at the following address :  https://stid-bdd.iut.univ-metz.fr/TermWatch/
  • Type in your login and password.
  • The typical stages to perform text data analysis with TermWatch are:
    1. Upload your corpus into the system
    2. Launch the term extraction module
    3. upload your list of terms into the database
    4. perform variation and clustering
    5. export results to MySQL database
    6. navigate among the results

    Term extraction via TermWatch (English only)

    We implemented term extraction using LTchunker and some handcrafted term extraction rules.  LTchunker is external to the system and is disbributed freely for research puposes by the LP Group, University of Edinburgh. We have been able to optimize TermWatch's internal programs (variations identification and clustering) but not the term extraction phase. So the process can take time (from 1 hour upwards for a corpus of 455 000 words). If you are in a hurry (how could you be ?), we suggest you extract your terms with any term extractor available to you prior to using TermWatch. You can also implement  your own extraction using any existing tagger and pattern-matching rules.  If your terms are already extracted, go straight to "Uploading your terms" section and follow from there. If however, you want to extract terms via TermWatch, here's howto:
    1. upload your corpus into the system following the steps indicated in that section
    2. Select Run local term candidate extractor in TermWatch phase menu
    3. Read carefully all the warnings before purchasing. The new candidates will be automatically inserted in the terms_tbl table. The process could take more than one hour for big corpus of abstracts. If this is the case of your corpus you should split corpus_tbl in several smaller tables using SQL commands via the phpmyadmin interface.

    Uploading your corpus in TermWatch

    If you want to be able to navigate in the source texts after clustering, then you can also upload your corpus into the database. Before filling out the import form, check the structure of your corpus: identify fields and field separators, text separators, special characters you want to conserve or remove, etc.  A headache here is making sure your corpus has regular field separators the correct recognition of fields during import...
    To upload your corpus:

    Uploading your list of terms in TermWatch

    1. Select phase (2) "Re-load corpus of terms" from the TermWatch phase,
    2. Hit  GO!.  The system tells you it could not load the table of terms in the database. 
    3. Click on the "database" hypertext in this message. This opens the PhPMyAdmin interface
    4. On the left pane, select "terms_tbl" table.
    5. If your term list is bigger than the maximum size set (2 Mo), you have to make different files and upload them successively into the same table (term_tbl or corpus_tbl depending on what you are uploading).
    6. Click on the menu "Insert data from a textfile into table". This message is found at the bottom of the current window. It is diplayed in the chosen language of your browser.
    7.  In the first row of the Input list of TermWatch table: use browse button to select the location of the textfile
    8. Leave the default options in the cells "Replace table data with file" and "fields terminated by".
    9. In the row "Fields enclosed by" , tick the box "optionally"
    10. In the "LOAD method" row, leave "DATA LOCAL" and press "Submit". The system will first try this method to upload terms, if this fails, an error message will be displayed. In this case, try the "DATA" option for uploading your terms.
    11. The system tags the list of terms using WordNet lemmas, this could take some seconds or a minute depending on the size of your corpus.

    Reload corpus of terms

    1. Once you have uploaded your list of terms in the MySQL database, return to the TermWatch window
    2. From the "TermWatch phase" pane, select (2) "Re-load corpus of terms". As TermWatch keeps the last loaded corpus in memory, this step is called "Re-load". It is advisable to do  "Re-load corpus of terms" in between sessions to make sure you are working on your latest list of terms.
    3. Hit the GO! button.
    4. TermWatch loads the list of terms in the MySQL database and prints the message "List of terms successfully loaded from the table terms_tbl in your database using the same login information." 
    5. If you wish to view the list of terms, click on the hypertext "database". Normally, this opens the list of terms in a MySQL PhpMyAdmin interface.

    From here, you can either run TermWatch in : default mode or interactive mode. Default mode is advisable if you are not familiar with works on terminological variations and their significance or if this is your first time of using the system.


    Running TermWatch in default mode

    N.B. It is Advisable to use this mode if you are not familiar with works on the terminological variations relations used by TermWatch for clustering.

    Before you run the system on your own corpus, you have to upload your list of terms into the MySQL database used by TermWatch (see help on uploading terms into TermWatch).

    Default parameters

    • The  "default parameters" set for TermWatch were chosen from empirical evidence in the case where TermWatch is being used for science and technology watch. Thus the resulting clusters may not be semantically constrained, i.e., the terms in the same cluster can be from different semantic categories, provided they share some lexical elements in common.
    •  In the default mode, the relations considered are the following : modifier expansions (left-expansion, insertion) and modifier subsitution on terms of length >=3 ; head expansions, head substitution on terms of length >=3.  The different variation relations available in TermWatch are explained in the "available variations" section.
    • Default threshold is set to "0" and the number of Iterations = 2.
    • Default weight is set to "1" for  the following relations: Exp-l, Ins, M_sub_3
    • Default weight is set to "2" for  Exp-2, Exp-r, sub_head_3. Explanations on the significance of the weights are given in Weight.
    • Hit GO! button to run TermWatch.

    N.B. In the default mode, one-word terms like "analysis, activity, system, blood" and binary substitution relations like "fine bran, coarse bran, rice bran, wheat bran, defatted bran" will be ignored. If you need to consider them, then you should run TermWatch with your own parameters.


    TermWatch in interactive mode

    In this mode, the user can select:

    • the precise TermWatch module s/he wants to run,
    • the relations used for clustering,
    • the role each one plays,
    • their weights
    • the number of iterations of the clustering algorithm.

    Step 3: Variation extraction and clustering

    1. Return to the TermWatch Interface.
    2. From the "TermWatch phase" pane, choose (step 3) "Process variation extraction and clustering"
    3. Hit the  GO!

    This opens another window with two major zones: Programs and parameters, statistics on previous execution. Before running step (3), the "Programs and Parameter" pane has to be set.  

    Programs and parameters

    Scroll down to "Programs and parameters" area.  This is divided into three sub-areas: select the program, clustering parameters and check available variations. Each sub-area is explained in more details below.

    Select the program

    This pull-down menu shows the available TermWatch modules.  At the initial stage, the "check available variation" pane is empty. You have to run the variation relations successively in order for the "available variations" to appear in the right pane.  The modules available in this menu are:

    • Overall : this texecutes the whole process of variation and clustering in default mode.
    • step1_var_exp searches for expansion, insertions and spelling variants
    • step1_var_sub searches for lexical substitution variants
    • step1_var_sub_wn searches for WordNet substitution variants (semantic variants)
    • step2_cpcl computes the connected components and the clusters.
    • var_reset reinitializes the variation search and empties the right hand pane.

    Important

    You have to run each program separately by selecting its name and hitting the GO! button.
    The step1 programs do not neet to set the clustering parameters areas. So leave this zone if you are only interested in running the variations.
    Executions takes...a few seconds (normally ;-)).
    Once a variation program is executed, the available variation pane is filled up with the corresponding variation types. Repeat this process for all the programs you want to run.
    If you run the clustering program (step2_cpcl), then you MUST also set the clustering parameters. This is explained below.

    Clustering parameters

    Important

    This is only necessary for the step2_cpcl program and actually builds the clusters from the variations obtained by the preceding programs.
    For clustering to take place, you have to specify the threshold, the number of iterations and the weight and role of each variation (right pane).

    Threshold

    Threshold is the minimal weight at which relations between connected components can be considered for clustering.
    - a threshold of "0" means all links are considered.
    - a threshold of  0.01 means links of that level and above are considered, etc.
    Experimentally, you can try several links.

    Iterations

    Specify here the number of times the clustering algorithm iterates the second level of clustering (i.e. grouping connected components into clusters). Since TermWatch is based on hierarchical algorithm, the more the iterations, the bigger the size of the clusters, i.e., clusters are merged in subsequent iterations or new components are integrated.
    Different values work for different corpora. You can only determine your ideal value experimentally. Let's just say that we have often got optimal results at the 2nd iteration...
    Once the program, the threshold and iterations are set, hit the GO! button below.

    Important

    For clusters to be computed, you must run step2_cpcl. However, if you wish to obtain only paradigmatic classes of terms (groups of terms with the same head word),  run the step1_var programs only.

    Check available variations in TermWatch

    There are many ways in which these relations can be presented. For reasons of efficiency, we will present them according to the two stages of clustering in the TermWatch algorithm. For a linguistically-motivated presentation, see Fidelia Ibekwe-SanJuan's homepage.     

    Weight

    This determines how the clustering algorithm should "weigh" the total number of each variation type in the graph. Weight takes two values, either "1" or "2".
    Weight=1 means the total number of relations is taken "as is" during clustering, i.e, if there are 500 insertions in the graph, then this number is used as is in computing the strength of the links between components.
    Weight=2 means the system will take the inverse of the total number of that particular type of relations. If your graph has 200 sub_head_3 relations, then only half of this number will be used in the index for weighting variation links between two components. The idea is to handicap very prolific relations like lexical modifier or head substitutions which may "drown" the more rarer variation types, thus making the information they carry invisible in the final clusters.
    In essence, you should assign a weight of "1" to relations that are of prime importance to your target application and assign a weight of "2" to more subsidiary relations.
    Default weights and roles have been set. You could first try them and evaluate the results.

    Role

    Relations used for clustering can be asigned a role of 1, 2 or 3.
    1 = relation is used at the first level of clustering to form connected components (COMP)
    2 = relation is used for clustering connected components to obtain the topic clusters
    3 = the relation is ignored

    Role "1" (COMP) relations : these are typically the relations which you want to assign a prime role. They will form groups of related term at the first level i.e connected components in formal terms. Experimentally, we have chosen in different experiments, the following relations as COMP :
    Ins (Insertion) : wheat germ effects / wheat germ enrichment effects
    Exp-l (Left-expansion) : flour fractionation / wheat flour fractionation
    spelling : on-line database / online database
    sub_modifier_2 :  fine bran / edible bran
    sub_modifier_3: raw wheat germ /  stabilized wheat germ
    sub_head_wn (WordNet head substitutions from the same synset): immune reactivity / immune responsiveness.

    Role "2" (CLAS) relations : these will cluster connected components using a hierarchical clustering algorithm. Here, the weight assigned to each variation type will be taken into consideration for computing the strength of the link between two components. Relations which could be considered at this stage are:
    Exp-r (head-expansion) :  wheat bran / wheat bran incorporation
    Exp-2  (head-modifier expansion) : rye bran /  wheat flour rye bran supplementation
    sub_head_2 (Head-substitution on binary terms): flour fractionation / flour type
    sub_head_3 (Head-substitution on terms >=3):  wheat flour fractionation /  wheat flour supplementation

    These relations also work between terms of two syntactic structures, for instance "query language access / acces to structured query language".

    Example of a default setting for variations "weight" and "role"

    Variation exp-2 exp-l ins spelling sub_head2 sub_head3 sub_head_wn sub_mod2 sub_mod3 sub_mod_wn
    Weight 1 1 1 1 2 2 1 2 2 1
    Role 2 1 1 1 3 2 1 3 2 1

    N.B.
    - Relations with role set to "3" are ignored in the clustering.
    - WordNet substitutions are semantic variants, and are thus more meaningful than the lexical variants.
    - The user could make a distinction between these two categories by using only semantic substitutions at the COMP level (role=1), then using the lexical substitutions in the second stage of clustering (role=2).
    - Lexical substitutions can be suitable for identifying co-hyponyms of a parent concept (edible bran, fine bran are both types of "bran") although this can sometimes lead to noise.
    - Also, binary substitutions are very prolific. It is advisable to check the result of adding them in the final clusters built. The user can run the program again and can eventually deselect them by assgning them the role of 3. - Note also that the variations are constrained in that added words have to be consecutive to avoid generating "accidental relations", although this possibility is not entirely ruled out when using the default mode. In this mode, no semantic variant is included . 
    - These choices have been made empirically. They may not be ideally suited for your corpus. You have to test different values untils cluster contents suit your application needs (see step4 and step5). Finally, the choices you make depend on the target application.

    1. Once the step2_cpcl program, the clustering parameters, the variation weights and roles have been specified,
    2. hit the GO! button under this pane (bottom of the page) for the system to build the clusters. This re-initializes the "Statistics on previous execution" area.

    Statistics on previous execution

    This zone shows the statistics issuing from the last execution of TermWatch.

    Statistics on variation execution

    This shows the number of terms and relations found for each variation type;

    Statistics on CPCL clustering

    Shows the same for the clustering : the total number of terms in clusters, total number of components, the maximum size of a component; total number of clusters and size of biggest cluster.
    It is good to take note of these figures as they will indicate if the results are meaningful or not. Especially, take note of the size of the biggest cluster ("Max" column), number of clusters, total number of terms in clusters.
    At this stage, you can either view the results of the clustering or redo the clustering using other parameters.

    The "parameters, variations, Weight & Role" tables recall the parameters used to obtain these results (see Programs and parameters section).


    Step 4: Export results to the MySQL database

    You have to export the results of the clustering to the database before they can be explored.

    1. For this, go to the top of the page.
    2. From the pull-down menu, select (4) Export results to the MySQL database,
    3. From the line "What do you want to export", select "send all results" to view the variations and clustering; "send only the variations used for clustering" or "only clustering results" (only those connected components that are integrated into clusters).
    4. hit GO! A message is diplayed to tell you that your results have been successfully exported to the database.

    At this stage, you can view either the results via the PhPMyAdmin interface if you wish or navigate them via TermWatch's interface.
    For the former, click on the hypertext "database" to go over to the database and click on the "cluster_tbl". For the latter, see step 5 below.


    Step 5: Navigate among the results

    Select this option from the TermWatch phase pull-down menu and hit GO!. This brings you to a window where you have two possibilities for navigating the results:

    • through the html interface via the "Class" table, this enables you to view class contexts in navigational mode;
    • Through the Aisee visualization interface. This will need installing Aisee on your computer, getting a license (free for research purposes and can be got in 1 day!) and uploading the "GDL file which is the required format for this graphic tool. N.B."

    We recommend you try both as Aisee offers a graphic and suggestive display of the map of clusters. This is quite useful if you want to grasp the layout of research topics (usually more fun than lists !), also relevant if you are using TermWatch for science and technology watch. Well, let's start with the navigational mode first.

    Navigational mode

    This is immediately available in the  "Class" table. Clusters are presented  in descending order of size.

    1. To view a cluster's content, select the cluster (radio button in front) and hit the nearest GO! button.
    2. TermWatch displays the cluster contents: first the connected components in the cluster with their size.
    3. You can click a connected component to view the terms within and the total number of variants for that term.
    4. Select a term and hit GO! , TermWatch generates a four column table displaying the following information:
      1. the term's different variants,
      2. the components in which it occurs,
      3. the clusters which contain the components,
      4. the variation type

    Navigation operates on three hierarchical levels: cluster, component and term levels. Any of the information here is clickable:

    1. a click on a variant + GO! displays the variants of this term and the documents in which the term appeared. The term is underlined in bold
    2. You can do multiple selections but only across different columns. You can select only one item per column.
    3. For instance, select a term variant, a component and a cluster and hit the GO! button. However, TermWatch will only consider the rightmost selection, i.e. will only display the cluster selected.
    4. Hit RESET to return to the list of clusters
    5. Alternatively, you may want to navigate results by searching specific terms. Enter the keyword in the search zone on top and hit GO!
    6. Depending on the results, you may wish to re-do the clustering with different parameters.

    Graphic display mode

    For this you need to install the Aisee visualization tool on your computer.
    TermWatch outputs a "GDL" file which is ready to be loaded onto the Aisee interface. Here are very succint hints on using Aisee to navigate TermWatch's clusters. More complete information on Aisee's functions can be found its the website.

    1. After installation, launch Aisee from your desktop.
    2. From TermWatch's interface, click on "gdl file" in the hypertext message "You can download the gdl file to be vizualized using AiSee." on the first page showing list of clusters (use RESET to return to this page).
    3. The gdl file is loaded in a browser window. Save this file on your computer.
    4. Return to the Aisee interface and click the Open_File menu (or corresponding icon), load the file.
    5. Aisee displays the message "Warning: graph is not connected. Components will be divergent. Force directed placement will require maximal time". Just ignore it.
    6. Click "Toggle text window/graph window" icon (second from right). This actually displays the graph of clusters. Press OK to "switch to graph window" message.
    7. You can also display the Aisee panner which enables you to position the cursor on a particular zone of the graph. The panner remains active until you de-activate this option.
    8. Use the horizontal bar to regulate the zoom: moving it to your left reduces the size of the display thus enabling you see the whole layout and vice versa.
    9. Displaying information on clusters: To display information on a cluster, first select the cluster, then hit one of the letters printed on the message bar of Aisee. This will highlight the required information. You can click on other clusters to highlight the same type of information. Right click to delesect the function.
    10. Fold/unfold clusters: To unfold a cluster, select the cluster, then the Folding menu, chose "Wrap/unwrap graph".
    11. You will get the same message as above ("Warning: graph is not connected. Components will be divergent. Force directed placement will require maximal time").  Ignore it and click "Toggle text window / graph window" to see the unfolded graph. Use the panner to position the cursor on the area (sometimes you have to search for it a bit...).
    12. The unfolded graph shows the components contained in the graph and how they are linked. The idea is to view the internal structure of a cluster.
    13. You can display the same types of information as on a cluster: number of terms, most active variants. Just click on the component, hit one of the letters displayed below. Right click to deselect.
    14. To fold a cluster, go through the same menu as for unfolding.
    15. Expose/hide links: To avoid the image being cluttered, you can choose to view only clusters with outgoing links above a certain threshold. In our adaptation of Aisee, we specified four threshold for outgoing links.
    16. To deselect these links, go to Folding menu, choose "Expose/hide links" and de-select  "1/4 weakest links" and OK.
    17. If the image is still cluttered, you can try deselecting "1/3 weakest links", and so on... until you are satisfied with the image.
    18. Center node: To position on a specific cluster, go to "Position_center node" menu, scroll down to the class or component required. Click center. Aisee positions the cursor on that element. Ok to close the window.
    19. Follow edges: In a cluttered graph, it is sometimes useful to follow all the links of a cluster. To do this, select a cluster, click "Position_follow edge" menu. A link will be outlined in bold, continue clicking on the cluster to view its other links
    20. When positioned on a link and you wish to view the links from the cluster pointed to, right click. This second cluster becomes the starting point to view external links.. and so on. To de-activate this function, right click twice.
    21. Exporting Aisee graphs: It is also possible to export parts of whole of Aisee graphs as image files.
    22. For this, set the horizontal bar at the required level of legibility.
    23. Click "File_Print export part" or "Print/export graph" depending.
    24. Set the parameters as follows for a white/black image:
      • tick "Maxspect"
      • color mode: black+white
      • orientation: portrait (or other)
      • Paper size: choose yours
      • OK button. Choose file name and "ps" (postscript) type.
      • Open the file with Ghostview (installed on your computer of course !)
      • Use the "File_convert" option to convert image either to jpg (recommended) or to other image formats. Set the resolution and OK.

    THE END...