- Timestamp:
- 07/18/16 12:26:03 (8 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/CrossPare/src/de/ugoe/cs/cpdp/training/WekaLocalFQTraining.java
r99 r135 35 35 36 36 /** 37 * <p> 37 38 * Trainer with reimplementation of WHERE clustering algorithm from: Tim Menzies, Andrew Butcher, 38 39 * David Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann, 39 40 * "Local versus Global Lessons for Defect Prediction and Effort Estimation," IEEE Transactions on 40 41 * Software Engineering, vol. 39, no. 6, pp. 822-834, June, 2013 41 * 42 * With WekaLocalFQTraining we do the following: 1) Run the Fastmap algorithm on all training data, 43 * let it calculate the 2 most significant dimensions and projections of each instance to these 44 * dimensions 2) With these 2 dimensions we span a QuadTree which gets recursively split on 45 * median(x) and median(y) values. 3) We cluster the QuadTree nodes together if they have similar 46 * density (50%) 4) We save the clusters and their training data 5) We only use clusters with > 47 * ALPHA instances (currently Math.sqrt(SIZE)), rest is discarded with the training data of this 48 * cluster 6) We train a Weka classifier for each cluster with the clusters training data 7) We 49 * recalculate Fastmap distances for a single instance with the old pivots and then try to find a 50 * cluster containing the coords of the instance. 7.1.) If we can not find a cluster (due to coords 51 * outside of all clusters) we find the nearest cluster. 8) We classify the Instance with the 52 * classifier and traindata from the Cluster we found in 7. 42 * </p> 43 * <p> 44 * With WekaLocalFQTraining we do the following: 45 * <ol> 46 * <li>Run the Fastmap algorithm on all training data, let it calculate the 2 most significant 47 * dimensions and projections of each instance to these dimensions</li> 48 * <li>With these 2 dimensions we span a QuadTree which gets recursively split on median(x) and 49 * median(y) values.</li> 50 * <li>We cluster the QuadTree nodes together if they have similar density (50%)</li> 51 * <li>We save the clusters and their training data</li> 52 * <li>We only use clusters with > ALPHA instances (currently Math.sqrt(SIZE)), the rest is 53 * discarded with the training data of this cluster</li> 54 * <li>We train a Weka classifier for each cluster with the clusters training data</li> 55 * <li>We recalculate Fastmap distances for a single instance with the old pivots and then try to 56 * find a cluster containing the coords of the instance. If we can not find a cluster (due to coords 57 * outside of all clusters) we find the nearest cluster.</li> 58 * <li>We classify the Instance with the classifier and traindata from the Cluster we found in 7. 59 * </li> 60 * </p> 53 61 */ 54 62 public class WekaLocalFQTraining extends WekaBaseTraining implements ITrainingStrategy { 55 63 64 /** 65 * the classifier 66 */ 56 67 private final TraindatasetCluster classifier = new TraindatasetCluster(); 57 68 69 /* 70 * (non-Javadoc) 71 * 72 * @see de.ugoe.cs.cpdp.training.WekaBaseTraining#getClassifier() 73 */ 58 74 @Override 59 75 public Classifier getClassifier() { … … 61 77 } 62 78 79 /* 80 * (non-Javadoc) 81 * 82 * @see de.ugoe.cs.cpdp.training.ITrainingStrategy#apply(weka.core.Instances) 83 */ 63 84 @Override 64 85 public void apply(Instances traindata) { … … 71 92 } 72 93 94 /** 95 * <p> 96 * Weka classifier for the local model with WHERE clustering 97 * </p> 98 * 99 * @author Alexander Trautsch 100 */ 73 101 public class TraindatasetCluster extends AbstractClassifier { 74 102 103 /** 104 * default serialization ID 105 */ 75 106 private static final long serialVersionUID = 1L; 76 107 77 /* classifier per cluster */ 108 /** 109 * classifiers for each cluster 110 */ 78 111 private HashMap<Integer, Classifier> cclassifier; 79 112 80 /* instances per cluster */ 113 /** 114 * training data for each cluster 115 */ 81 116 private HashMap<Integer, Instances> ctraindata; 82 117 83 /* 118 /** 84 119 * holds the instances and indices of the pivot objects of the Fastmap calculation in 85 120 * buildClassifier … … 87 122 private HashMap<Integer, Instance> cpivots; 88 123 89 /* holds the indices of the pivot objects for x,y and the dimension [x,y][dimension] */ 124 /** 125 * holds the indices of the pivot objects for x,y and the dimension [x,y][dimension] 126 */ 90 127 private int[][] cpivotindices; 91 128 92 /* holds the sizes of the cluster multiple "boxes" per cluster */ 129 /** 130 * holds the sizes of the cluster multiple "boxes" per cluster 131 */ 93 132 private HashMap<Integer, ArrayList<Double[][]>> csize; 94 133 95 /* debug vars */ 134 /** 135 * debug variable 136 */ 96 137 @SuppressWarnings("unused") 97 138 private boolean show_biggest = true; 98 139 140 /** 141 * debug variable 142 */ 99 143 @SuppressWarnings("unused") 100 144 private int CFOUND = 0; 145 146 /** 147 * debug variable 148 */ 101 149 @SuppressWarnings("unused") 102 150 private int CNOTFOUND = 0; 103 151 152 /** 153 * <p> 154 * copies an instance such that is is compatible with the local model 155 * </p> 156 * 157 * @param instances 158 * instance format 159 * @param instance 160 * instance that is copied 161 * @return 162 */ 104 163 private Instance createInstance(Instances instances, Instance instance) { 105 164 // attributes for feeding instance to classifier … … 127 186 128 187 /** 188 * <p> 129 189 * Because Fastmap saves only the image not the values of the attributes it used we can not 130 190 * use the old data directly to classify single instances to clusters. 191 * </p> 192 * <p> 193 * To classify a single instance we do a new Fastmap computation with only the instance and 194 * the old pivot elements. 195 * </p> 196 * </p> 197 * After that we find the cluster with our Fastmap result for x and y. 198 * </p> 131 199 * 132 * To classify a single instance we do a new fastmap computation with only the instance and 133 * the old pivot elements. 134 * 135 * After that we find the cluster with our fastmap result for x and y. 200 * @param instance 201 * instance that is classified 202 * @see weka.classifiers.AbstractClassifier#classifyInstance(weka.core.Instance) 136 203 */ 137 204 @Override … … 169 236 double[][] distmat = new double[2 * FMAP.target_dims + 1][2 * FMAP.target_dims + 1]; 170 237 distmat[0][0] = 0; 171 distmat[0][1] = 172 dist.distance(clusterInstance, 173 this.cpivots.get((Integer) this.cpivotindices[0][0])); 174 distmat[0][2] = 175 dist.distance(clusterInstance, 176 this.cpivots.get((Integer) this.cpivotindices[1][0])); 177 distmat[0][3] = 178 dist.distance(clusterInstance, 179 this.cpivots.get((Integer) this.cpivotindices[0][1])); 180 distmat[0][4] = 181 dist.distance(clusterInstance, 182 this.cpivots.get((Integer) this.cpivotindices[1][1])); 183 184 distmat[1][0] = 185 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 186 clusterInstance); 238 distmat[0][1] = dist.distance(clusterInstance, 239 this.cpivots.get((Integer) this.cpivotindices[0][0])); 240 distmat[0][2] = dist.distance(clusterInstance, 241 this.cpivots.get((Integer) this.cpivotindices[1][0])); 242 distmat[0][3] = dist.distance(clusterInstance, 243 this.cpivots.get((Integer) this.cpivotindices[0][1])); 244 distmat[0][4] = dist.distance(clusterInstance, 245 this.cpivots.get((Integer) this.cpivotindices[1][1])); 246 247 distmat[1][0] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 248 clusterInstance); 187 249 distmat[1][1] = 0; 188 distmat[1][2] = 189 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 190 this.cpivots.get((Integer) this.cpivotindices[1][0])); 191 distmat[1][3] = 192 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 193 this.cpivots.get((Integer) this.cpivotindices[0][1])); 194 distmat[1][4] = 195 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 196 this.cpivots.get((Integer) this.cpivotindices[1][1])); 197 198 distmat[2][0] = 199 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 200 clusterInstance); 201 distmat[2][1] = 202 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 203 this.cpivots.get((Integer) this.cpivotindices[0][0])); 250 distmat[1][2] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 251 this.cpivots.get((Integer) this.cpivotindices[1][0])); 252 distmat[1][3] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 253 this.cpivots.get((Integer) this.cpivotindices[0][1])); 254 distmat[1][4] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][0]), 255 this.cpivots.get((Integer) this.cpivotindices[1][1])); 256 257 distmat[2][0] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 258 clusterInstance); 259 distmat[2][1] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 260 this.cpivots.get((Integer) this.cpivotindices[0][0])); 204 261 distmat[2][2] = 0; 205 distmat[2][3] = 206 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 207 this.cpivots.get((Integer) this.cpivotindices[0][1])); 208 distmat[2][4] = 209 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 210 this.cpivots.get((Integer) this.cpivotindices[1][1])); 211 212 distmat[3][0] = 213 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 214 clusterInstance); 215 distmat[3][1] = 216 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 217 this.cpivots.get((Integer) this.cpivotindices[0][0])); 218 distmat[3][2] = 219 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 220 this.cpivots.get((Integer) this.cpivotindices[1][0])); 262 distmat[2][3] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 263 this.cpivots.get((Integer) this.cpivotindices[0][1])); 264 distmat[2][4] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][0]), 265 this.cpivots.get((Integer) this.cpivotindices[1][1])); 266 267 distmat[3][0] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 268 clusterInstance); 269 distmat[3][1] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 270 this.cpivots.get((Integer) this.cpivotindices[0][0])); 271 distmat[3][2] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 272 this.cpivots.get((Integer) this.cpivotindices[1][0])); 221 273 distmat[3][3] = 0; 222 distmat[3][4] = 223 dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 224 this.cpivots.get((Integer) this.cpivotindices[1][1])); 225 226 distmat[4][0] = 227 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 228 clusterInstance); 229 distmat[4][1] = 230 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 231 this.cpivots.get((Integer) this.cpivotindices[0][0])); 232 distmat[4][2] = 233 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 234 this.cpivots.get((Integer) this.cpivotindices[1][0])); 235 distmat[4][3] = 236 dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 237 this.cpivots.get((Integer) this.cpivotindices[0][1])); 274 distmat[3][4] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[0][1]), 275 this.cpivots.get((Integer) this.cpivotindices[1][1])); 276 277 distmat[4][0] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 278 clusterInstance); 279 distmat[4][1] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 280 this.cpivots.get((Integer) this.cpivotindices[0][0])); 281 distmat[4][2] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 282 this.cpivots.get((Integer) this.cpivotindices[1][0])); 283 distmat[4][3] = dist.distance(this.cpivots.get((Integer) this.cpivotindices[1][1]), 284 this.cpivots.get((Integer) this.cpivotindices[0][1])); 238 285 distmat[4][4] = 0; 239 286 … … 243 290 * distmat[0].length; j++) { if(biggest < distmat[i][j]) { biggest = distmat[i][j]; 244 291 * } } } if(this.show_biggest) { Console.traceln(Level.INFO, 245 * String.format(""+clusterInstance)); Console.traceln(Level.INFO, 246 * String.format("biggest distances: "+ biggest)); this.show_biggest = false; }292 * String.format(""+clusterInstance)); Console.traceln(Level.INFO, String.format( 293 * "biggest distances: "+ biggest)); this.show_biggest = false; } 247 294 */ 248 295 … … 316 363 cnumber = clusternumber.next(); 317 364 for (int i = 0; i < ctraindata.get(cnumber).size(); i++) { 318 if (dist.distance(instance, ctraindata.get(cnumber).get(i)) <= min_distance) 365 if (dist.distance(instance, 366 ctraindata.get(cnumber).get(i)) <= min_distance) 319 367 { 320 368 found_cnumber = cnumber; … … 347 395 } 348 396 397 /* 398 * (non-Javadoc) 399 * 400 * @see weka.classifiers.Classifier#buildClassifier(weka.core.Instances) 401 */ 349 402 @Override 350 403 public void buildClassifier(Instances traindata) throws Exception { … … 421 474 422 475 // Console.traceln(Level.INFO, 423 // String.format("size for cluster ("+small[0]+","+small[1]+") - ("+big[0]+","+big[1]+")")); 476 // String.format("size for cluster ("+small[0]+","+small[1]+") - 477 // ("+big[0]+","+big[1]+")")); 424 478 425 479 // 5. generate quadtree … … 439 493 440 494 // recursive split und grid clustering eher static 441 TREE.recursiveSplit(TREE);495 QuadTree.recursiveSplit(TREE); 442 496 443 497 // generate list of nodes sorted by density (childs only) … … 465 519 } 466 520 else { 467 Console.traceln(Level.INFO, 468 String.format("drop cluster, only: " + current.size() + 469 " instances")); 521 Console.traceln(Level.INFO, String 522 .format("drop cluster, only: " + current.size() + " instances")); 470 523 } 471 524 } … … 505 558 // traindata_count += ctraindata.get(cnumber).size(); 506 559 // Console.traceln(Level.INFO, 507 // String.format("building classifier in cluster "+cnumber +" 560 // String.format("building classifier in cluster "+cnumber +" with "+ 508 561 // ctraindata.get(cnumber).size() +" traindata instances")); 509 562 } … … 516 569 517 570 /** 518 * Payload for the QuadTree. x and y are the calculated Fastmap values. T is a weka instance. 571 * <p> 572 * Payload for the QuadTree. x and y are the calculated Fastmap values. T is a Weka instance. 573 * </p> 574 * 575 * @author Alexander Trautsch 519 576 */ 520 577 public class QuadTreePayload<T> { 521 578 522 public double x; 523 public double y; 579 /** 580 * x-value 581 */ 582 public final double x; 583 584 /** 585 * y-value 586 */ 587 public final double y; 588 589 /** 590 * associated instance 591 */ 524 592 private T inst; 525 593 594 /** 595 * <p> 596 * Constructor. Creates the payload. 597 * </p> 598 * 599 * @param x 600 * x-value 601 * @param y 602 * y-value 603 * @param value 604 * associated instace 605 */ 526 606 public QuadTreePayload(double x, double y, T value) { 527 607 this.x = x; … … 530 610 } 531 611 612 /** 613 * <p> 614 * returns the instance 615 * </p> 616 * 617 * @return 618 */ 532 619 public T getInst() { 533 620 return this.inst; … … 536 623 537 624 /** 538 * Fastmap implementation539 * 540 * Faloutsos, C., & Lin, K. I. (1995). FastMap: A fast algorithm for indexing, data-mining and625 * <p> 626 * Fastmap implementation after:<br> 627 * * Faloutsos, C., & Lin, K. I. (1995). FastMap: A fast algorithm for indexing, data-mining and 541 628 * visualization of traditional and multimedia datasets (Vol. 24, No. 2, pp. 163-174). ACM. 629 * </p> 542 630 */ 543 631 public class Fastmap { 544 632 545 /* N x k Array, at the end, the i-th row will be the image of the i-th object */ 633 /** 634 * N x k Array, at the end, the i-th row will be the image of the i-th object 635 */ 546 636 private double[][] X; 547 637 548 /* 2 x k pivot Array one pair per recursive call */ 638 /** 639 * 2 x k pivot Array one pair per recursive call 640 */ 549 641 private int[][] PA; 550 642 551 /* Objects we got (distance matrix) */ 643 /** 644 * Objects we got (distance matrix) 645 */ 552 646 private double[][] O; 553 647 554 /* column of X currently updated (also the dimension) */ 648 /** 649 * column of X currently updated (also the dimension) 650 */ 555 651 private int col = 0; 556 652 557 /* number of dimensions we want */ 653 /** 654 * number of dimensions we want 655 */ 558 656 private int target_dims = 0; 559 657 560 // if we already have the pivot elements 658 /** 659 * if we already have the pivot elements 660 */ 561 661 private boolean pivot_set = false; 562 662 663 /** 664 * <p> 665 * Constructor. Creates a new Fastmap object. 666 * </p> 667 * 668 * @param k 669 */ 563 670 public Fastmap(int k) { 564 671 this.target_dims = k; … … 566 673 567 674 /** 568 * Sets the distance matrix and params that depend on this 675 * <p> 676 * Sets the distance matrix and params that depend on this. 677 * </p> 569 678 * 570 679 * @param O 680 * distance matrix 571 681 */ 572 682 public void setDistmat(double[][] O) { … … 578 688 579 689 /** 690 * <p> 580 691 * Set pivot elements, we need that to classify instances after the calculation is complete 581 692 * (because we then want to reuse only the pivot elements). 693 * </p> 582 694 * 583 695 * @param pi 696 * the pivots 584 697 */ 585 698 public void setPivots(int[][] pi) { … … 589 702 590 703 /** 704 * <p> 591 705 * Return the pivot elements that were chosen during the calculation 706 * </p> 592 707 * 593 * @return 708 * @return the pivots 594 709 */ 595 710 public int[][] getPivots() { … … 598 713 599 714 /** 600 * The distance function for euclidean distance 601 * 602 * Acts according to equation 4 of the fastmap paper 715 * <p> 716 * The distance function for euclidean distance. Acts according to equation 4 of the Fastmap 717 * paper. 718 * </p> 603 719 * 604 720 * @param x … … 606 722 * @param y 607 723 * y index of y image (if k==0 y object) 608 * @param kdimensionality 609 * @return distance 724 * @param k 725 * dimensionality 726 * @return the distance 610 727 */ 611 728 private double dist(int x, int y, int k) { … … 624 741 625 742 /** 626 * Find the object farthest from the given index This method is a helper Method for 627 * findDistandObjects 743 * <p> 744 * Find the object farthest from the given index. This method is a helper Method for 745 * findDistandObjects. 746 * </p> 628 747 * 629 748 * @param index … … 646 765 647 766 /** 648 * Finds the pivot objects 767 * <p> 768 * Finds the pivot objects. This method is basically algorithm 1 of the Fastmap paper. 769 * </p> 649 770 * 650 * This method is basically algorithm 1 of the fastmap paper. 651 * 652 * @return 2 indexes of the choosen pivot objects 771 * @return 2 indexes of the chosen pivot objects 653 772 */ 654 773 private int[] findDistantObjects() { … … 668 787 669 788 /** 670 * Calculates the new k-vector values (projections) 671 * 672 * This is basically algorithm 2 of the fastmap paper. We just added the possibility to 673 * pre-set the pivot elements because we need to classify single instances after the 674 * computation is already done. 675 * 676 * @param dims 677 * dimensionality 789 * <p> 790 * Calculates the new k-vector values (projections) This is basically algorithm 2 of the 791 * fastmap paper. We just added the possibility to pre-set the pivot elements because we 792 * need to classify single instances after the computation is already done. 793 * </p> 678 794 */ 679 795 public void calculate() { … … 713 829 714 830 /** 831 * <p> 715 832 * returns the result matrix of the projections 833 * </p> 716 834 * 717 835 * @return calculated result
Note: See TracChangeset
for help on using the changeset viewer.