Removing Outliers
(using the percentile-based approach)
The signal strength collected has some
outlier values that possibly effect the result of our algorithm. This is because
those outlier values alter the statistical information representing the set of
signal strength collected. If those values are removed then those that are most
likely to be the most common ones for this AP and the specific cell will remain,
and the compare between training and run-time values will be more accurate and
trustworthy, because values created due to external unpredicted factors
(outliers) will be removed from the signal.
In order to validate the original hypothesis we used two different methods of
removing outliers. Firstly, the Quantitative Approach to Outliers and secondly, the Boxplot
method.
We used those two methods because they use
different statistical data and the both are commonly
used in bibliography as removing outliers
methods.
Quantitative Approach to Outliers :
considers as outliers the values not contained in the interval [mean-2*std, mean+2*std],
where mean is the mean value and the std the standard deviation of the vector of
signal strength values.
Boxplot :
considers as outliers the values not belonging in the interval [x25-1.5*IQR,
x75+1.5*IQR], where x25 and x75 are the
first and third
percentiles, and IQR = x75 - x25.

The maximum location error is
decreased by 2.5 m, whereas the median value of all three methods is 2 m. The
percentile-based method using boxplot is the most accurate one as far as higher
location errors are concerned.
In the most cases the method using boxplot creates tighter intervals, but the
size of the interval containing the statistical information used (percentiles)
for each cell and each AP, depends on each vector of values. This is because
both approaches to remove the outliers use different statistical data, both
depending on the values of each vector. Boxplot uses the first and third
percentiles in order to compute the interval of values that contain the
percentiles we use in the percentile-based method. On the other hand, the
quantitative approach to outliers uses the mean and the standard deviation
values.
The boxplot method we have already
used is removing mild outliers. Another idea is to remove only the extreme
outliers. This happens if we consider as outliers the values not belonging in
this interval [x25-3*IQR,
x75+3*IQR]. As shown in the following Figure, the median location
error having removed only the extreme outliers is decreased to 1,8 m.
To make these results more easily understood let's focus on one position. We
assume that the user's position is in the center of a circle.
-
If we remove the extreme outliers
for the cells belonging inside the circle with radiant 1 or 2 m, we have the
same results as if we had removed the mild outliers. Consequently, removing
higher values of the SS (higher values of SS means that the user is closer
to the AP), cells that are close to the AP look much more like cells more
far in the circle.
-
SS of cells belonging in the ring
formulated by the circle with radiant 1, 2 m and by the one with radiant 2,
2 m, have similar outliers. This means that by eliminating the extreme
values, we only keep the information that differs the cells between them.
-
Finally, for cells in greater
distance from the circle with radiant 2, 2 m, outliers have a negative
impact and our goal should be more of them to be removed and the interval to
become more tight.

Empirical removal of outliers : Another more
empirical way to remove outliers is to remove the first and last decile and only
keep the 8 remaining deciles in order to compute the position of the user. It is
obvious that the results of this method do not improve at all the location
error. This is because important information is removed by removing from all
cells these 2 values.
The methods mentioned above remove selected SS values from some cells only,
where there are outlier values. On the other hand, this method removed
information - values from all cells without taking into consideration the actual
values.

Changing the Benchmark
Because of the nature of the testbed,
we decided to change the benchmark from location error in meters to location
error as far as the correct VITR is found.
Percentile-based
with
Boxplot
(mild outliers)
Probability of visual contact
with the right VITR no matter the room
93% 89%
Probability of visual contact
with the right VITR in the right room
86%
82%
Probability of visual contact
with the right VITR in different room
6.8%
6.8%
Probability of being found in
the wrong room and in front of the wrong VITR
6.8%
10.3%
This happens because despite
the reduction in the total location error with the use of boxplot, the truth is
that location errors (higher than 5-6 m) in cells are reduced. However, there are changes in some other cells that did not
have a high location error in the first place. Moreover, the location error in
some cells in the end of a room has decreased but the estimated position of the
user has moved from a far away cell of the same room to another much closer cell
but in the neighboring room. This is the reason why we noticed a much lower
probability of being in the right room according to CLS.
The samples collected in area A are not representing in the
best way the testbed of the aquarium, because the cells we have used so far, are
in areas where there are quite big VITR. The results from area B are going to be
on line soon. This is when we will be able to tell if our system has a high
accuracy.
Finally, we believe that if we
use another characteristic with smaller range (i.e. RFID tags) along with the
existing system, the accuracy will be highly improved. This is because the RFID
tag would eliminate the search of our system in just one room instead of the
whole aquarium or any testbed used.
Statistically
Process of the Signal
Positioning based on the
methodology described in
MSWiM07 paper. The procedure consists of
two main steps:
- Training
phase: at each cell ck ,k=1,…,N, of the training set, we collect a number of signal strength values for each AP
(30 values for each AP). The time-series, containing 300 samples, cannot be
fitted by a theoretical distribution. For this purpose it is transformed using
the Box-Cox method. On the “normalized” time-series, we then apply the SSA
technique to analyze it in a set of “eigenloads”. We observe that in this case
each eigenload can be approximated using a probability distribution, such as the
Gamma and the Weibull. Thus, the “feature vector” of each training cell contains
the estimated parameters of the particular model
, where ai
,bi are the estimated parameters corresponding to the i-th
eigenload, and L is the number of (principal) eigenloads used in the
approximation of the time-series.
-
Runtime phase:
at the unknown cell cr we collect a number of measurements,
and apply the SSA approach as before, after normalizing the time-series.
Similarly, the runtime cell is associated with a vector of estimated model
parameters
.The
decision about the coordinates of the unknown cell is received by measuring the
similarity between the feature vectors of cr and each training
cell. In our case, we use the Kullback-Leibler Divergence (KLD) as a similarity
function, which results in a closed form expression depending on the model
parameters. The total similarity between two cells is defined as the sum of the
partial KLD’s:
where
denotes
the KLD between the i-th eigenload of the training and runtime cell.
The disadvantage of the above
approach is that the original time-series are not normalized in the same way,
since different training cells may have different AP’s with constant signal
strength value.
This motivates the use of a
Joint-Clustering approach as a possible way to overcome this problem, and also
to reduce the computational complexity.
Joint-Clustering procedure :
This approach consists also of the
same two steps, namely the offline and the online phase. The difference from the
previous method, is that before applying the SSA we perform the estimation of a
joint probability distribution for each training cell, as well as a clustering
of the training cells.
-
Training phase:
the first step of this phase is to keep the k strongest (with the highest
mean value) AP’s for each training cell, as opposed to the previous method where
we take into account all the AP’s. Then, we estimate the joint probability
distribution for each cell, based on the selected AP’s
Assuming that the signal
strength values from different AP’s are independent, the above
expression becomes
. In this case, the feature
vector of each training cell may consist of the estimated
parameters of different distributions, since, for instance, the
first AP is modeled by a Gamma distribution, while the
second one is modeled by a Weibull.
At the end of this process,
each training cell is associated with a featured vector consisted
in general of several statistical models.
The next step consist of
clustering the training cells, using the q strongest AP’s (q
can be less than or equal to k). That is, two
training cells belong to the same cluster if they have the
same set of q strongest
AP’s.
- Online phase: first, we
keep the q strongest AP’s for the runtime cell, and we find its
corresponding cluster. Then, we have to measure the similarities between the
runtime cell and the training cells of that particular cluster, and thus,
reducing the computational complexity. Notice again that in this case the
feature vectors may consist of several models, and thus generally the partial
KLD’s are computed between different probability distributions. For instance,
could
be the KLD between a Gamma and a Gaussian distribution, where ak,i,bk,i,
are the estimated parameters of a Gamma distribution and ar,i,br,i,
are the estimated parameters of a Gaussian distribution.
This approach offers a higher
flexibility, since several AP’s can be modeled with different statistical
models.
Signal Strength Distribution :
We collect a number of signal strength values for each AP (60 values for each AP). The time-series, containing 480 samples, cannot be fitted by a theoretical distribution. For this purpose it is transformed using the Box-Cox method.
In deriving distributions that best our data ( “normalized” time-series), we repeatedly make use of formal and visual statistical analysis methods and tools, such as the quantile plots with simulation envelopes for specific cells.
As we can see in the above figures the
original data for one one cell, shown in bue, lie best within the natural variability
of the exponential model, where the maximum likelihood estimate of the
location parameter is 0.6020 since they remain within the cyan
simulation envelope.
Although using the chi-square test in order to find the distribution that best fits with our data we have the following results:
Distribution |
Chi-square error |
Normal
|
1439.875
|
Lognormal |
4521.5 |
Exponential |
1352.375 |
Gamma |
2161.125 |
Rayleigh |
1332.625 |
Weibull |
2691.5 |
Extreme value |
8520 |
As we can see, the chi-square error for the rayleigh distribution is 1332.625, and the chi-square error for the exponential distribution is 1352.375 .
So we tested for the some cell if any distribution hypothesis at all is validated using the chi2test, so that the intuition we initially had would be confirmed. The result was that no distribution hypothesis at all was validated
.The chi-square test returns two values, A and B. A is the computed chi-square statistic, and B is the
critical tabulated value at the degrees of freedom.
In general, if A is less than B, the H0 hypothesis
that DATA follows the DIST distribution is accepted. in our case, for this cell we have:
Distribution |
Critical tabulated value at the degrees of freedom |
Chi-square error |
Normal
|
40,113 |
1439.875
|
Lognormal |
40,113 |
4521.5 |
Exponential |
41,337 |
1352.375 |
Gamma |
40,113 |
2161.125 |
Rayleigh |
41,337 |
1332.625 |
Weibull |
40,113 |
2691.5 |
Extreme value |
40,113 |
8520 |
It is clear that neither of them validates the hypothesis. Exponential distribution that seems better according to the quantile plot with simulation envelope has 20 units in chi-square error more than rayleigh distribution.
Edited on 11/03/08 by Sofia Nikitaki
Joint area A and area B (area A+B) results:
Removing Outliers (using the percentile-based approach)
As described in the implementation of this algorithm for area A before, we got the following results for the decile- based implementation having removed the outlier values from signal strength using the boxplot method
Boxplot : considers as outliers the values not belonging in the interval [x25-1.5*IQR, x75+1.5*IQR], where x25 and x75 are the first and third percentiles, and IQR = x75 - x25.

Removing outliers and using the decile-based algorithm the median location error is 2 m.
.jpg)
Removing the outliers from the signal strength measurements we minimize the maximun location error 2.5 m. The median value of both methods is 2 m.
Using the boxplot method we remove information from the signal stength. Moreover by removing this information cells that are close to one AP look much more like cells more far from this AP.
Last edited on 24/03/08 by Sofia Nikitaki
We tested CLS and Ekahau systems at the same time and at the same positions on the 3rd of June at about 14.00 - 14.30 with approximately 40 people in the testbed, which is considered to be Normal Condition. In each zone we tested the system in 3 differnet positions.The accuracy of CLS is 88.8%, whereas Ekahau has 76% accuracy. In zones 11-15, all wrong results that Ekahau gave were neighboring zones, whereas in the case of CLS if the error was not a neighboring zone it was an outlier.
Zone |
CLS |
Ekahau |
|
#Correct Positions |
#Wrong Positions |
#Rigth Positions |
#Wrong Positions |
0 |
3 |
0 |
3 |
0 |
1 |
3 |
0 |
2 |
1 |
2 |
3 |
0 |
3 |
0 |
3 |
2 |
1 |
2 |
1 |
4 |
3 |
0 |
3 |
0 |
5 |
2 |
1 |
3 |
0 |
6 |
3 |
0 |
2 |
1 |
7 |
3 |
0 |
3 |
0 |
8 |
3 |
0 |
2 |
1 |
9 |
3 |
0 |
2 |
1 |
10 |
3 |
0 |
2 |
1 |
11 |
3 |
0 |
3 |
0 |
12 |
2 |
1 |
0 |
3 |
13 |
2 |
1 |
2 |
1 |
14 |
1 |
2 |
2 |
1 |
15 |
3 |
0 |
1 |
2 |
16 |
3 |
0 |
3 |
0 |
17 |
3 |
0 |
3 |
0 |
Last updated on 3/4/2008 by Sofia Nikitaki