AQUARIUM TESTBED

Removing Outliers (using the percentile-based approach)

The signal strength collected has some outlier values that possibly effect the result of our algorithm. This is because those outlier values alter the statistical information representing the set of signal strength collected. If those values are removed then those that are most likely to be the most common ones for this AP and the specific cell will remain, and the compare between training and run-time values will be more accurate and trustworthy, because values created due to external unpredicted factors (outliers) will be removed from the signal.
In order to validate the original hypothesis we used two different methods of removing outliers. Firstly, the Quantitative Approach to Outliers and secondly, the Boxplot method. We used those two methods because they use different statistical data and the both are commonly used in bibliography as removing outliers methods.

Quantitative Approach to Outliers : considers as outliers the values not contained in the interval [mean-2*std, mean+2*std], where mean is the mean value and the std the standard deviation of the vector of signal strength values.

Boxplot : considers as outliers the values not belonging in the interval [x₂₅-1.5*IQR, x₇₅+1.5*IQR], where x₂₅ and x₇₅ are the first and third percentiles, and IQR = x₇₅ - x₂₅.

The maximum location error is decreased by 2.5 m, whereas the median value of all three methods is 2 m. The percentile-based method using boxplot is the most accurate one as far as higher location errors are concerned.
In the most cases the method using boxplot creates tighter intervals, but the size of the interval containing the statistical information used (percentiles) for each cell and each AP, depends on each vector of values. This is because both approaches to remove the outliers use different statistical data, both depending on the values of each vector. Boxplot uses the first and third percentiles in order to compute the interval of values that contain the percentiles we use in the percentile-based method. On the other hand, the quantitative approach to outliers uses the mean and the standard deviation values.

The boxplot method we have already used is removing mild outliers. Another idea is to remove only the extreme outliers. This happens if we consider as outliers the values not belonging in this interval [x₂₅-3*IQR, x₇₅+3*IQR]. As shown in the following Figure, the median location error having removed only the extreme outliers is decreased to 1,8 m.
To make these results more easily understood let's focus on one position. We assume that the user's position is in the center of a circle.

If we remove the extreme outliers for the cells belonging inside the circle with radiant 1 or 2 m, we have the same results as if we had removed the mild outliers. Consequently, removing higher values of the SS (higher values of SS means that the user is closer to the AP), cells that are close to the AP look much more like cells more far in the circle.
SS of cells belonging in the ring formulated by the circle with radiant 1, 2 m and by the one with radiant 2, 2 m, have similar outliers. This means that by eliminating the extreme values, we only keep the information that differs the cells between them.
Finally, for cells in greater distance from the circle with radiant 2, 2 m, outliers have a negative impact and our goal should be more of them to be removed and the interval to become more tight.

Empirical removal of outliers : Another more empirical way to remove outliers is to remove the first and last decile and only keep the 8 remaining deciles in order to compute the position of the user. It is obvious that the results of this method do not improve at all the location error. This is because important information is removed by removing from all cells these 2 values.
The methods mentioned above remove selected SS values from some cells only, where there are outlier values. On the other hand, this method removed information - values from all cells without taking into consideration the actual values.

Changing the Benchmark

Because of the nature of the testbed, we decided to change the benchmark from location error in meters to location error as far as the correct VITR is found.

Percentile-based with Boxplot (mild outliers)

Probability of visual contact with the right VITR no matter the room 93% 89%

Probability of visual contact with the right VITR in the right room 86% 82%

Probability of visual contact with the right VITR in different room 6.8% 6.8%

Probability of being found in the wrong room and in front of the wrong VITR 6.8% 10.3%

This happens because despite the reduction in the total location error with the use of boxplot, the truth is that location errors (higher than 5-6 m) in cells are reduced. However, there are changes in some other cells that did not have a high location error in the first place. Moreover, the location error in some cells in the end of a room has decreased but the estimated position of the user has moved from a far away cell of the same room to another much closer cell but in the neighboring room. This is the reason why we noticed a much lower probability of being in the right room according to CLS.

The samples collected in area A are not representing in the best way the testbed of the aquarium, because the cells we have used so far, are in areas where there are quite big VITR. The results from area B are going to be on line soon. This is when we will be able to tell if our system has a high accuracy.

Finally, we believe that if we use another characteristic with smaller range (i.e. RFID tags) along with the existing system, the accuracy will be highly improved. This is because the RFID tag would eliminate the search of our system in just one room instead of the whole aquarium or any testbed used.

Statistically Process of the Signal

Positioning based on the methodology described in MSWiM07 paper. The procedure consists of two main steps:

Training phase: at each cell c_{k ,}k=1,…,N, of the training set, we collect a number of signal strength values for each AP (30 values for each AP). The time-series, containing 300 samples, cannot be fitted by a theoretical distribution. For this purpose it is transformed using the Box-Cox method. On the “normalized” time-series, we then apply the SSA technique to analyze it in a set of “eigenloads”. We observe that in this case each eigenload can be approximated using a probability distribution, such as the Gamma and the Weibull. Thus, the “feature vector” of each training cell contains the estimated parameters of the particular model , where a_{i
,}b_i are the estimated parameters corresponding to the i-th eigenload, and L is the number of (principal) eigenloads used in the approximation of the time-series.
Runtime phase: at the unknown cell c_r we collect a number of measurements, and apply the SSA approach as before, after normalizing the time-series. Similarly, the runtime cell is associated with a vector of estimated model parameters .The decision about the coordinates of the unknown cell is received by measuring the similarity between the feature vectors of c_r and each training cell. In our case, we use the Kullback-Leibler Divergence (KLD) as a similarity function, which results in a closed form expression depending on the model parameters. The total similarity between two cells is defined as the sum of the partial KLD’s: where denotes the KLD between the i-th eigenload of the training and runtime cell.

The disadvantage of the above approach is that the original time-series are not normalized in the same way, since different training cells may have different AP’s with constant signal strength value.

This motivates the use of a Joint-Clustering approach as a possible way to overcome this problem, and also to reduce the computational complexity.

Joint-Clustering procedure : This approach consists also of the same two steps, namely the offline and the online phase. The difference from the previous method, is that before applying the SSA we perform the estimation of a joint probability distribution for each training cell, as well as a clustering of the training cells.

Training phase: the first step of this phase is to keep the k strongest (with the highest mean value) AP’s for each training cell, as opposed to the previous method where we take into account all the AP’s. Then, we estimate the joint probability distribution for each cell, based on the selected AP’s
Assuming that the signal strength values from different AP’s are independent, the above expression becomes . In this case, the feature vector of each training cell may consist of the estimated parameters of different distributions, since, for instance, the first AP is modeled by a Gamma distribution, while the second one is modeled by a Weibull.

At the end of this process, each training cell is associated with a featured vector consisted in general of several statistical models.

The next step consist of clustering the training cells, using the q strongest AP’s (q can be less than or equal to k). That is, two training cells belong to the same cluster if they have the same set of q strongest AP’s.

Online phase: first, we keep the q strongest AP’s for the runtime cell, and we find its corresponding cluster. Then, we have to measure the similarities between the runtime cell and the training cells of that particular cluster, and thus, reducing the computational complexity. Notice again that in this case the feature vectors may consist of several models, and thus generally the partial KLD’s are computed between different probability distributions. For instance, could be the KLD between a Gamma and a Gaussian distribution, where a_k,i,b_k,i, are the estimated parameters of a Gamma distribution and a_r,i,b_r,i, are the estimated parameters of a Gaussian distribution.

This approach offers a higher flexibility, since several AP’s can be modeled with different statistical models.

Signal Strength Distribution :

We collect a number of signal strength values for each AP (60 values for each AP). The time-series, containing 480 samples, cannot be fitted by a theoretical distribution. For this purpose it is transformed using the Box-Cox method.

In deriving distributions that best our data ( “normalized” time-series), we repeatedly make use of formal and visual statistical analysis methods and tools, such as the quantile plots with simulation envelopes for specific cells.

As we can see in the above figures the original data for one one cell, shown in bue, lie best within the natural variability of the exponential model, where the maximum likelihood estimate of the location parameter is 0.6020 since they remain within the cyan simulation envelope.

Although using the chi-square test in order to find the distribution that best fits with our data we have the following results:

Distribution	Chi-square error
Normal	1439.875
Lognormal	4521.5
Exponential	1352.375
Gamma	2161.125
Rayleigh	1332.625
Weibull	2691.5
Extreme value	8520

As we can see, the chi-square error for the rayleigh distribution is 1332.625, and the chi-square error for the exponential distribution is 1352.375 .

So we tested for the some cell if any distribution hypothesis at all is validated using the chi2test, so that the intuition we initially had would be confirmed. The result was that no distribution hypothesis at all was validated

.The chi-square test returns two values, A and B. A is the computed chi-square statistic, and B is the critical tabulated value at the degrees of freedom. In general, if A is less than B, the H0 hypothesis that DATA follows the DIST distribution is accepted. in our case, for this cell we have:

Distribution	Critical tabulated value at the degrees of freedom	Chi-square error
Normal	40,113	1439.875
Lognormal	40,113	4521.5
Exponential	41,337	1352.375
Gamma	40,113	2161.125
Rayleigh	41,337	1332.625
Weibull	40,113	2691.5
Extreme value	40,113	8520

It is clear that neither of them validates the hypothesis. Exponential distribution that seems better according to the quantile plot with simulation envelope has 20 units in chi-square error more than rayleigh distribution.

Edited on 11/03/08 by Sofia Nikitaki

Joint area A and area B (area A+B) results:

Removing Outliers (using the percentile-based approach)

As described in the implementation of this algorithm for area A before, we got the following results for the decile- based implementation having removed the outlier values from signal strength using the boxplot method

Removing outliers and using the decile-based algorithm the median location error is 2 m.

Removing the outliers from the signal strength measurements we minimize the maximun location error 2.5 m. The median value of both methods is 2 m.

Using the boxplot method we remove information from the signal stength. Moreover by removing this information cells that are close to one AP look much more like cells more far from this AP.

Last edited on 24/03/08 by Sofia Nikitaki

We tested CLS and Ekahau systems at the same time and at the same positions on the 3^rd of June at about 14.00 - 14.30 with approximately 40 people in the testbed, which is considered to be Normal Condition. In each zone we tested the system in 3 differnet positions.The accuracy of CLS is 88.8%, whereas Ekahau has 76% accuracy. In zones 11-15, all wrong results that Ekahau gave were neighboring zones, whereas in the case of CLS if the error was not a neighboring zone it was an outlier.

Zone	CLS		Ekahau
	#Correct Positions	#Wrong Positions	#Rigth Positions	#Wrong Positions
0	3	0	3	0
1	3	0	2	1
2	3	0	3	0
3	2	1	2	1
4	3	0	3	0
5	2	1	3	0
6	3	0	2	1
7	3	0	3	0
8	3	0	2	1
9	3	0	2	1
10	3	0	2	1
11	3	0	3	0
12	2	1	0	3
13	2	1	2	1
14	1	2	2	1
15	3	0	1	2
16	3	0	3	0
17	3	0	3	0

Last updated on 3/4/2008 by Sofia Nikitaki