Landmark optimization
From TNT
Contents |
Overview
The method implemented in TNT seeks those ancestral landmark configurations which imply that the displacement between ancestral and descendant individual landmarks is minimum. The changes are measured as distances in the real space without any kind of decomposition or “transformation” of the data. The input data are the aligned coordinates for each landmark and the results are the coordinates of the landmarks in the ancestors. The method can be used both, to establish ancestral landmark configurations or to choose among trees (searches). While this method is a direct way to establish ancestral landmark configurations, in order to use this data to choose among trees the partial dependence among landmarks should be considered (see below). A deeper discussion of the bases and logics of the method is presented in Catalano et al. 2010. A complete explanation of the algorithms implemented in TNT to optimize landmark data is presented in Goloboff & Catalano 2011.
Input format
The format for landmark data is as follow:
xread 2 4 & [landmark 2d] Sp0 x,y x,y x,y Sp1 x,y x,y x,y Sp2 x,y x,y x,y Sp3 x,y x,y x,y & [landmark 3d] Sp0 x,y,z x,y,z x,y,z Sp1 x,y,z x,y,z x,y,z Sp2 x,y,z x,y,z x,y,z Sp3 x,y,z x,y,z x,y,z ;
The first number after the xread command is the number of landmark configurations NOT the number of landmark (if the matrix contains landmark data and other kind of characters, this number represents the sum of landmark configurations plus the number of other characters). TNT determines automatically the number of landmarks. The information of each landmark configuration is preceded by a line - & [landmark 2d]- that indicates whether the landmarks are in 2d or 3d. The data format is similar to regular characters except that the duplets or triplets of coordinates for each landmark are given separated by a comma and the data of different landmarks is separate by a space. Do not include a “ ; “ between blocks (landmark configurations in this case).
The general command to deal with landmarks is lmark.
Visualizing ancestral landmark configurations (Windows version only)
The option to visualize the ancestral landmark configurations is lmark map N/C N is the number of tree and C is the corresponding landmark configuration (remember that TNT starts counting from 0). When the configurations are shown, pressing “H” opens a helping window. The options in the visualization mode are:
A= Show/hide ancestral configurations T= Show/hide configurations of the terminals N= Display/hide all the configurations P= Show/hide a line representing the change in the position between each node and its ancestor D= Show/hide a line representing the change in the position between each node and its descendants +/-= = Increase the size of the boxes G= Display/hide the grid F= Display/hide the frame of each configuration C= Display/hide a connector between each landmark and the base of the box (only 3d) H= Help F1-F2= and F11-F12= Rotate and tilt the view of the configuration (3d only) F3-F4= make the tree narrower/wider F5-F6= make the tree shorter/longer
When displaying the ancestral landmarks configurations, TNT draws a line connecting the points following the order of the landmarks in the matrix. Alternatively, the user can define different lines using the following format:
lmark connect 0:1 0:6 0:1 0:7 1:2 1:8 ;
This means:
Connect landmark 1 of Configuration 0 with landmark 6 of Configuration 0;
Connect landmark 1 of Configuration 0 with landmark 7 of Configuration 0;
Connect landmark 1 of Configuration 1 with landmark 8 of Configuration 1;
Optimization algorithms
TNT implements two main procedures to optimize landmark data that are used in a combined way. The first step consists of an approximation to the optimal ancestral positions based on analyzing a reduced set of the immense number of possible positions that can take the landmarks in each ancestor. This algorithm is based on Sankoff optimization. The second step iteratively refines the positions derived from the first step.
First step
For each landmark, TNT superpose a grid on the area (or space if it is in 3d) occupied by the landmarks of all the terminals. Each observed point is assigned to one of these cells and the distance between the occupied cells is calculated. These distances are then incorporated into a cost matrix and subsequently used in a Sankoff optimization. As imagined, the results are generally improved significantly when the number of cells is increased. The number of cells per side is defined with lmark cell C (in 2d the total number of cells in the grid will be CxC and CxCxC in 3d). An alternative way to improve the score is to perform this step several times, but each time moving the grid a little bit. This produces that terminals that were assigned to a particular cell in a certain round can be assigned to a different one when a different grid is considered, producing possibly an improvement in the score. The option to set the number of different grids that TNT will use is lmark shake N. TNT implements another way to obtain better scores in this first step. The positions obtained in the first step will be quite near the optimal position. So, if a second optimization is performed sub-diving the optimal cells obtained in the first round plus a few neighbour cells, it is possible to improve the precision of the estimation in a much efficient way than if the number of cell is increased (see section TIME). This nested optimization can be repeated as many times as desired. Doing just one or two levels of nesting is generally enough to improve the score significantly. The option to change the nesting is lmark nest N W, where N is the level of nesting and W is the number of neighbouring apart from the optimal that will be included in each level of nesting.
Second step
Once approximate positions for all the points at the internal nodes have been obtained, they can be improved by visiting each node at a time and calculating the position of the landmark that minimizes the distance between the landmark in the corresponding node and the landmarks in the neighbour nodes (ancestor and descendants). Once all the nodes were visited, the process starts again until a certain number of iterations have been accomplished or the improved in the score between iterations is less than a threshold. The option to turn on the iterative improvement is lmark iter. To set the limit of iterations the option is lmark maxiters N C, where N is the maximum number of iterations to do and C is the threshold value previously described.
Using landmark data to choose among trees
NOTE!!! TNT cannot yet manage landmark data in the runs. This can only be done using scripts. Go here to download a script that allows to perform searches for small datasets (less than 25-30 taxa).
Using landmark data to choose among trees has the obvious problem of dependency among landmarks. Consider that you have you two landmark configurations, one of them represented by six landmarks and the other with twenty five. If data is analyzed as such and the trees are selected according the minimization of the sum of the scores of each landmark, the election of the best tree will be more influenced by the configuration with more landmarks. A quite easy way to try to alleviate this problem is to change the weights of the landmarks in such a way that the relative contribution of each landmark configuration to the total score is similar. TNT performs this by default by multiplying the score calculated for each landmark by a certain factor. The factors are calculated (see below) in such a way that the simultaneous displacement along the range in each of the landmarks is equivalent to one step in a discrete character. TNT also permits to consider different equivalences among different landmarks configurations and/or between landmark configurations and other characters. This can be easily done with the command ccode, the same that defines settings for custom characters.
A different, though related question is how the changes in position of the different landmarks of the same configuration are summed up. One of the possibilities is that two movements comprising the same absolute amount count the same independently of whether the changes represent a “big” change considering the total range of movements for each landmark or represents only a subtle change. Put in other words a large movement in one landmark will count more than a subtle change in a second landmark even if the movement in the second is the maximum possible for it. TNT allows the user to decide which of the treatments will be given to the landmarks. With the option lmark inscale (the default one), the landmarks that have larger absolute displacements will count more than those that have tiny movements. Alternatively, with the option lmark noinscale, changes in two different landmarks count the same if they represent the same relative movement with respect to the maximum possible in each case.
Execution time
By far the Sankoff step is most time consuming. The iterative step generally consumes a tiny fraction of time as the process generally produces a stable result in a few iterations. In Sankoff optimization, the time increases exponentially with the number of cells in the grid and near lineally with the levels of nesting and with the number of different grids considered (shaking). Hence, it is always preferable to use smaller numbers of cells and many levels of nesting. How many cells and levels of nesting are advisable will depend on the dataset and also on whether the goal is to establish ancestral landmark configurations or to use this data in tree searches.
RAM requirements
Landmark optimization requires larger amount of RAM memory than traditional characters. Very thorough analyses (high number of cells and several nesting levels) of medium size matrices can require about 1 gygabyte. However, in most of the cases the memory required is much lesser, in the frame of hundreds of megabytes. Importantly, lmark command has the option of using low-memory algorithms that, although slower, will allow saving in memory requirements. The option is lmark lowmem.
An example
peces.tnt is a matrix of landmark data in 2d extracted from MacLeod (2002) and originally produced by Naylor (1996). The dataset is very simple: it has only nine taxa and 56 landmarks. The file also contains the optimal tree.
First we will define the settings for the optimization, typing in the command line the following option:
lmark cell 20 nes 2 3 iter
The first option (cell 20) determines that the grid to be used in the sankoff step will have 20 divisions (400 cells in total). The second option (nes 2 3) indicates that a nested sankoff will be done with two levels of nesting considering a widows of three cells. Finally the option iter indicates that after the sankoff step, the resulting ancestral landmark positions will be refined using the Fermat iterations.
After defining the settings, we will visualize the ancestral landmark configuration for the
configuration (character in TNT terms) number 0 optimized onto the tree 0. This option also retrieves the tree score.
lmark map 0/0;
Here the green lines (dispayed pressing "d") represent the change in possition bwteen the landmarks in that node and lanmdarks of its corresponding descendants. The blue lines (dispayed pressing "p") represents the changes in possition between that node and its ancestor.
To print the coordinates for the ancestral landmark configurations we will first set TNT to do that:
lmark showhtus ;
After that, type lmark ; to see the htu's coordinates in the screen
If a log file is opened and the text buffer is saved there, these coordinates can be, after modifying the format, used to visualize the ancestral landmark configuration in any MG software.
We will now calulate the score of the the second landmark of the third configuration on tree 0:
lmark lscores 0/2/1 ;
Scripting commands to deal with landmark data
Path: Data/Character Settings

