AQUARIUM TESTBED

Removing Outliers (using the percentile-based approach)

The signal strength collected has some outlier values that possibly effect the result of our algorithm. This is because those outlier values alter the statistical information representing the set of signal strength collected. If those values are removed then those that are most likely to be the most common ones for this AP and the specific cell will remain, and the compare between training and run-time values will be more accurate and trustworthy, because values created due to external unpredicted factors (outliers) will be removed from the signal.
In order to validate the original hypothesis we used two different methods of removing outliers. Firstly, the Quantitative Approach to Outliers and secondly, the Boxplot method. We used those two methods because they use different statistical data and the both are commonly used in bibliography as removing outliers methods.

Quantitative Approach to Outliers : considers as outliers the values not contained in the interval [mean-2*std, mean+2*std], where mean is the mean value and the std the standard deviation of the vector of signal strength values.

Boxplot : considers as outliers the values not belonging in the interval [x25-1.5*IQR, x75+1.5*IQR], where x25 and x75 are the first and third percentiles, and IQR = x75 - x25.

The maximum location error is decreased by 2.5 m, whereas the median value of all three methods is 2 m. The percentile-based method using boxplot is the most accurate one as far as higher location errors are concerned.
In the most cases the method using boxplot creates tighter intervals, but the size of the interval containing the statistical information used (percentiles) for each cell and each AP, depends on each vector of values. This is because both approaches to remove the outliers use different statistical data, both depending on the values of each vector. Boxplot uses the first and third percentiles in order to compute the interval of values that contain the percentiles we use in the percentile-based method. On the other hand, the quantitative approach to outliers uses the mean and the standard deviation values.

The boxplot method we have already used is removing mild outliers. Another idea is to remove only the extreme outliers. This happens if we consider as outliers the values not belonging in this interval [x25-3*IQR, x75+3*IQR]. As shown in the following Figure, the median location error having removed only the extreme outliers is decreased to 1,8 m.
To make these results more easily understood let's focus on one position. We assume that the user's position is in the center of a circle.

Empirical removal of outliers : Another more empirical way to remove outliers is to remove the first and last decile and only keep the 8 remaining deciles in order to compute the position of the user. It is obvious that the results of this method do not improve at all the location error. This is because important information is removed by removing from all cells these 2 values.
The methods mentioned above remove selected SS values from some cells only, where there are outlier values. On the other hand, this method removed information - values from all cells without taking into consideration the actual values.

Changing the Benchmark

Because of the nature of the testbed, we decided to change the benchmark from location error in meters to location error as far as the correct VITR is found.

                                                                                                                                    Percentile-based            with Boxplot (mild outliers)

Probability of visual contact with the right VITR no matter the room                                              93%                              89%

Probability of visual contact with the right VITR in the right room                                                  86%                              82%

Probability of visual contact with the right VITR in different room                                                   6.8%                            6.8%

Probability of being found in the wrong room and in front of the wrong VITR                                    6.8%                           10.3%

This happens because despite the reduction in the total location error with the use of boxplot, the truth is that location errors (higher than 5-6 m) in cells are reduced. However, there are changes in some other cells that did not have a high location error in the first place. Moreover, the location error in some cells in the end of a room has decreased but the estimated position of the user has moved from a far away cell of the same room to another much closer cell but in the neighboring room. This is the reason why we noticed a much lower probability of being in the right room according to CLS.

The samples collected in area A are not representing in the best way the testbed of the aquarium, because the cells we have used so far, are in areas where there are quite big VITR. The results from area B are going to be on line soon. This is when we will be able to tell if our system has a high accuracy.

Finally, we believe that if we use another characteristic with smaller range (i.e. RFID tags) along with the existing system, the accuracy will be highly improved. This is because the RFID tag would eliminate the search of our system in just one room instead of the whole aquarium or any testbed used.

 

Statistically Process of the Signal

Positioning based on the methodology described in MSWiM07 paper. The procedure consists of two main steps:

The disadvantage of the above approach is that the original time-series are not normalized in the same way, since different training cells may have different AP’s with constant signal strength value.

This motivates the use of a Joint-Clustering approach as a possible way to overcome this problem, and also to reduce the computational complexity. 

Joint-Clustering procedure : This approach consists also of the same two steps, namely the offline and the online phase. The difference from the previous method, is that before applying the SSA we perform the estimation of a joint probability distribution for each training cell, as well as a clustering of the training cells. 

At the end of this process, each training cell is associated with a featured vector consisted in general of several statistical models.

The next step consist of clustering the training cells, using the q strongest AP’s (q can be less than or equal to k). That is, two training cells belong to the same cluster if they have the same set of q strongest AP’s.

This approach offers a higher flexibility, since several AP’s can be modeled with different statistical models.           

Signal Strength Distribution :

We collect a number of signal strength values for each AP (60 values for each AP). The time-series, containing 480 samples, cannot be fitted by a theoretical distribution. For this purpose it is transformed using the Box-Cox method.

In deriving distributions that best our data ( “normalized” time-series), we repeatedly make use of formal and visual statistical analysis methods and tools, such as the quantile plots with simulation envelopes for specific cells.

 

 

As we can see in the above figures the original data for one one cell, shown in bue, lie best within the natural variability of the exponential model, where the maximum likelihood estimate of the location parameter is 0.6020 since they remain within the cyan simulation envelope.

Although using the chi-square test in order to find the distribution that best fits with our data we have the following results:

Distribution
Chi-square error
Normal

1439.875

Lognormal 4521.5
Exponential 1352.375
Gamma 2161.125
Rayleigh 1332.625
Weibull 2691.5
Extreme value 8520

As we can see, the chi-square error for the rayleigh distribution is 1332.625, and the chi-square error for the exponential distribution is 1352.375 .

So we tested for the some cell if any distribution hypothesis at all is validated using the chi2test, so that the intuition we initially had would be confirmed. The result was that no distribution hypothesis at all was validated

.The chi-square test returns two values, A and B. A is the computed chi-square statistic, and B is the critical tabulated value at the degrees of freedom. In general, if A is less than B, the H0 hypothesis that DATA follows the DIST distribution is accepted. in our case, for this cell we have:

Distribution
Critical tabulated value at the degrees of freedom Chi-square error
Normal
40,113

1439.875

Lognormal 40,113 4521.5
Exponential 41,337 1352.375
Gamma 40,113 2161.125
Rayleigh 41,337 1332.625
Weibull 40,113 2691.5
Extreme value 40,113 8520

 

It is clear that neither of them validates the hypothesis. Exponential distribution that seems better according to the quantile plot with simulation envelope has 20 units in chi-square error more than rayleigh distribution.

Edited on 11/03/08 by Sofia Nikitaki

Joint area A and area B (area A+B) results:

Removing Outliers (using the percentile-based approach)

As described in the implementation of this algorithm for area A before, we got the following results for the decile- based implementation having removed the outlier values from signal strength using the boxplot method

Boxplot : considers as outliers the values not belonging in the interval [x25-1.5*IQR, x75+1.5*IQR], where x25 and x75 are the first and third percentiles, and IQR = x75 - x25.

Removing outliers and using the decile-based algorithm the median location error is 2 m.

Removing the outliers from the signal strength measurements we minimize the maximun location error 2.5 m. The median value of both methods is 2 m.

Using the boxplot method we remove information from the signal stength. Moreover by removing this information cells that are close to one AP look much more like cells more far from this AP.

Last edited on 24/03/08 by Sofia Nikitaki

We tested CLS and Ekahau systems at the same time and at the same positions on the 3rd of June at about 14.00 - 14.30 with approximately 40 people in the testbed, which is considered to be Normal Condition. In each zone we tested the system in 3 differnet positions.The accuracy of CLS is 88.8%, whereas Ekahau has 76% accuracy. In zones 11-15, all wrong results that Ekahau gave were neighboring zones, whereas in the case of CLS if the error was not a neighboring zone it was an outlier.

Zone
CLS
Ekahau
#Correct Positions
#Wrong Positions
#Rigth Positions
#Wrong Positions
0
3
0
3
0
1
3
0
2
1
2
3
0
3
0
3
2
1
2
1
4
3
0
3
0
5
2
1
3
0
6
3
0
2
1
7
3
0
3
0
8
3
0
2
1
9
3
0
2
1
10
3
0
2
1
11
3
0
3
0
12
2
1
0
3
13
2
1
2
1
14
1
2
2
1
15
3
0
1
2
16
3
0
3
0
17
3
0
3
0

 

Last updated on 3/4/2008 by Sofia Nikitaki