TermWatch's homepage More info on TermWatch interface for statistical analysis

How to use graph and text-mining procedures

This interface allows to run usual R procedures on data stored in the local MySQL server.

It also allows to add R procedures and import / export data via ssh connections.

Eric SanJuan

Contents

  1. Start
  2. Putting a new corpus on the server
  3. Corpus preprocessing steps in TW'master.pl interface
  4. TW's index.pl user interface for terminological variation extraction and clustering
  5. Computing association graphs
  6. SmallWorld programs: Computing indicators between nodes.
  7. desart programs: Graph desarticulation
  8. Computing association rules

Start

Open 3 tabs in Browser and open:


Putting a new corpus on the server

If the corpus needs to put on the account in the server or taken from it, you need to connect to the server "stid-bdd" and to export or import a corpus:

  1. use scp on Linux or open WinSCP on windows
  2. connect to stid-bdd.iut.univ-metz.fr
  3. connect using user name and passwd
  4. make a corpus available in an account, drag and drop afile from your computer to your dir. (csv format preferable)

Corpus preprocessing steps in TW'master.pl interface.

If ISI corpus or text with csv format then, go to:

  1. master
  2. TermWatch phase, select phase (1) Home in the program zone, hit "GO!"
  3. scroll down to field "Select the file you wish to import from your personal directory : and choose the csv file you wish to analyze, hit "GO!".

Data has to be in tabulated files.

To put text ISI corpus in the adequate format for analysis by TW using the "master.pl" interface

  1. Go to phase 3 "process data and statistical processing", hit "GO!"
  2. launch program 000_ISI2csv.pl, in the parameter zone, type filename without the extension ".csv": This program reformats (tabulate) the ISi corpus
  3. You should obtain a message: go to phase (1) to see the output file, you can download it and have a look at the reformatting of the corpus.

Once you have a data file nicely tabulated, you can send it to the database. To do this the data file name has to be of the form "namefile.csv" and a table "filename_csv" with the correct number of columns must exists on the database.

If the data comes from a ISI text corpus, the table in the database with optimized inverted files can automatically created using the following program:

  1. return to phase (3) and launch program "001_preproc_ISI_tbl.sql", no parameter needed. This program creates the mysql table of the corpus, this name has to exist.
  2. To see the result, go to the MySql page of the database, click "reset" page to reactualise the DB_username" account Re-select the database "DB_username", you should see a "corpus_tbl" table which is still empty (you can't use the browse button of MySQL"
    N.B. if a corpus with more fields is given, you will need to create a new table with corresponding number of fields (in MySQl, click "operations" tab)

Now, to send thetabulated data on its corresponding table do:

  1. Return to TermWatch/master.pl page, and select phase 4 "Export data to MySql database "hit "GO"
  2. field "Select what do yo want to export to the database after checking that all target tables "file_tbl" already exist in your database
  3. select the filename of your corpus in csv format, e.g. ISI_corpus.csv.

In the case of data coming from an ISI corpus, this puts the corpus in the corresponding Mysql database already created by "001_preproc_ISI_tbl.sql" program the resulting table will be something like "ISI_corpus_tbl". (If you return to the MySQL interface, reset the page and re-select your database, you should be able to browse the "ISI_corpus_tbl" and see how the corpus has been stored.)

From here you can use phpmyadmin interface and sql queries to extract data or use TermWatch terminological and variation extraction. To extract terminology see the following section. If not, jump it.


TW's index.pl user interface for terminological variation extraction and clustering

First you need to prepare a table with three columns: doc_number, title, abstract or text for termwatch.

As usually if your data comes from a ISI text corpus, then this can be done automatically using the following programs.

  1. Return to TermWatch/master.pl and select phase (3), hit "GO"
  2. launch "002_preproc_corpus_tbl.pl" which prepares the corpus for TW (extracts from the corpus the text fields "abstract, title" with the ref. ID of document) and puts them in "corpus_tbl".
  3. Go to phpmyadmin
    • select "DB_username" and verify that the database is complete
    • ISI_corpus_tbl : browse and check completeness of fields and records
    • corpus_tbl : check presence of 3 fields (doc, title, abstracts), check that every field opens (you can see the text)
  4. If all ok, then you need to transfer "corpus_tbl" to TW's mysql database, named "user_DB" - select "corpus_tbl", go to "Operations", select "move to" and select the other TW database hit "go".

Now we going the native TermWatch interface

  1. go to index
  2. login
  3. If you don't already have a list of terms from your corpus, you need to start by running the "Optional run local term candidate extractor" program which extracts terms from your corpus on which the other programs will run.
    N.B. Check that your "terms_tbl" table in the MySql database is empty before running term extraction. If not, empty it, just click on "terms_tbl" and then "empty", because the term extraction program will add the extracted terms from your new corpus to old terms which can be from another corpus.
  4. In the TermWatch interface, reply the confirmation message and select "re-start". This will take some time depending on your corpus size.
  5. Run the step (2) "Reload corpus of terms"
  6. In step (3), do "var reset" in order to re-initialize (clear) the statistics on previous program
  7. select and run successively the different variation extraction programs
    N.B. Before running "Step3_class_doc",
    • a- you have to empty your variation tables in the MySQL database if they are not empty (var_tbl, cluster_tbl, class_tbl, comp_tbl)
    • b- export results to MySql database,
    • c- after running the other step (3) programs
  8. select phase (4), Export results to MySQL database, choose select "export all results". If next time, you only re-cluster the terms, then choose the adequate option for export. Run the follow the usual programs or consult the specilized user guide.
  9. Once yur staisfied by the quality end labels of the obtained classes, go back to step (3) process variation extraction and clustering and chose the program step3_class_doc.sh. This will generate on user_DB a new table for graph association analysis.
  10. Finally, go to MySQL interface and move "class_doc_tbl" from MySQL database "user_DB" to "DB_user" on which TermWatch/master.pl will work.

Computing association graphs

We come back to: master to compute and analyze association graphs.

We suppose now that you have on DB_user a table with two columns that represents the hypergraph from which we are going to extract the association graphs.

These hypergrapsh can be the table class_doc_tbl computed in previous section. If this the case, The following program allows to enrich these table with author names.

  1. Go to phase (2) Import data from my SQL database, select the table "class_doc_tbl"
  2. Go to phase (3) Process statistics and clustering
  • a- run the 010_hyper_doc_comp_year.sql program. Go to mysql DB_user, reset the database and browse the new tables added (doc_year_tbl: links year with doc)
  • b- hypergraph_tbl: associates class with doc and author names). The other programs will work on this hypergraph_tbl, can add other fields (keywords, etc..)#
  • Go to phase (3) and run "011_exploring_tbl". creates a new table in MySql for exploring the associations in ISI corpus.

    Once the table containing the hypergraph is ready (a two column table with only symbolic values):

    1. go to phase (2) and import the hypergraph_tbl. It is loaded in master.pl home directory.
    2. Launch successively the programs beginning wih "10xxxx.pl". They work directly on files in master.pl home, no more in MySQL DB You need to always go to (1) Home to see resulting files of each program executed.
      • a- 100_asso_val_graph: computes the coeff. of associaton between terms and other information units in an association file such as hypergraph_tbl or class_doc_tbl (now .csv in master.pl home) an association is found if a term or one of its variants or a chain of its variants appears in a document (author, keyword) View results on step (1) of interface. User can specify co-occ. and association thresholds, just add the values separated by blank.
      • b- 101_connect.pl: computes the most prominent connected components (at least 10% of the whole graph) in the graph sense, input file could be for example "hyergraph.asso" but also any file with pairwise associations the resulting files is named "filename_comp_xxx_0" where xxx is the number of units you have to watch this number 'cos if it's too high, i.e., >900, means the graph will be difficult to separate (desarticulation) so if you get a high number in the filename, just run the program again with higher thresholds.
      • c- before luanching the next program, first import "doc_year_tbl" from MySQL via step (2) and select the correspondinf file in master.pl's home
      • d- 102_vertex_period.pl : associates a time period to each node in the initial association file (e.g hypergraph) #needs in input this association file (e.g. hypergraph) and another file with doc and the year in which it appeared (doc_year_tbl)#
      • d- gdl_builder_pl: generates from any file of association (e.g. hypergraph_asso) the Aisee map by applying the clustering module of CPCL. The output are three files: - filename.gdl ; filename_CPCL_graph (graph of clusters), filename_clusters (contents of each cluster)
      • e- gdl_vertex_coloring: takes as input the output of gdl_builder (a gdl file) and the file which contains the color of each time slice (e.g. hypergraph_period) and colors the nodes of the output map, specify the two filenames separated by a blank.

    All graphs are stored as e list of edges (i.e. a pair of names separated by a tab). Thus most of the previous and following programs can be combined.


    SmallWorld programs: Computing indicators between nodes.


    desart programs: Graph desarticulation

    Any graph can be decomposed into atoms which are minimal connected subgraphs witout complete separator. The following programs use C++ Bangaly Kaba's programs to compute these atoms. They then compute the graph of atoms and genrate an GDL file to be vizualized using AiSee.

    1. desart.pl: needs a graph file as input with connected components (i.e., should have as input, an output from "101_connect.pl" program)
      n.B this file should not have more than 1000 nodes, because program complexity is quadratic produces three files: central atom (filename_BA : the most central non decomposable components), all the atoms(filename_total : reunites the central and peripheral), all peripheral atoms (filename_dstd) filename_total, filename_dstd : nodes represent atom. these files are in csv format now.
    2. To visualize the three output, you need to apply "gdlbuilder.sh"
      • a- but you need to first apply "connect.pl" to the "filename_dstd" in order to identify the connected components of the peripheral atoms
      • b- generate gdl output of each file
      • c- color the gdl files using "gdl_vertex_coloring", you need to specify input file and the year_colour file where it takes the color scheme from, e.g. "filename_BA.gdl hypergraph_period". The result will be of the form "filename_BA_period.gdl"

    Computing association rules

    The following programs use R software and its arules package.
    1. DM1_item_matrix converts a list of transactions given in the form of a two column table with tabs where the first column contains the id of transactions and the second one the items.
      It requires 4 parameters:
      • 1. input file
      • 2. output file
      • 3. minimal item frequence
      • 4. minimal number of items in a transaction
    2. DM2_arules.R is an R program based on arules package that computes the association rules. By default it works on the file item_matric.csv supposing that this is the output of the previous program. This can be changed using the parameter box. Minimal rule confidence (80% by default) and support (0.001 by default) can also be changed using this parameter box.
    3. DM3_closed_sets computes the closed sets based on previous computed rule. A set of items is said to be closed if whenever a previous applies, it produces an item already in the set. A closed set it is not necessarely a frequent item set since rules are applied using transitivity and that these rules are approximative (95% of confidence by default).
    4. DM4_test_closeness tests if a clustering with possible overlappings respects a set of rules in arules WRITE procedure output format. It gives a measure of cluster coherence given a set of rules.