Computing Statistics - Single Statistics

Single statistics are statistical functions used to perform preliminary descriptive analyses.

They are univariate statistics.

Properties

Properties

Description

General parameters

Statistics on integer variables are continuous

If selected, statistics will be displayed as continuous values (e.g. 12,34 instead of a rounded off 12).

Sample size

Number of total valid samples

Number of rows with valid values, not including missing values.

Concentration measures

Gini concentration index

Computes the Gini concentration index, which is obtained from the following equation:

where:

  • N is the number of valid values, 

  • represents the cumulative percentage of the attribute X and 

  • is the corresponding expected value under the hypothesis of equal distribution.

Dispersion and heterogeneity measures

Number of mode elements

Shows the number of cases where the value corresponds to the mode. In a statistical distribution the mode represents the most frequently observed value, and the number of mode elements corresponds to the number of occurrences of tied values.

Entropy

A heterogeneity measure, which is obtained, for a categorical variable, from the following equation:

where:

  • i indicates each category, 

  • pi is the corresponding frequency

  • k denotes the number of categories.

Normalized entropy

Value obtained by dividing the entropy value by its maximum theoretical value (i.e. log(n)).

Gini coefficient

Coefficient commonly applied to qualitative ordinal variables to evaluate the dispersion of values across different categories. It is estimated by the following equation:

where:

  • indicates each category

  • pi is the corresponding frequency.

G takes the minimum value of zero when there is only one category (maximum concentration), while if there is only one value in each stratum (maximum dispersion) .

Normalized Gini coefficient

Coefficient with the same meaning as the non-normalized statistic, but it is forced to vary between 0 and 1, simply dividing the G value by its maximum theoretical value 

Range of values

The difference between the highest and the lowest values observed for the attribute X.

Interquartile range

A dispersion measure often applied to continuous variables when their distribution is non Gaussian. It represents the difference between the 75th and the 25th percentile, corresponding to the 75% and the 25% value of the cumulative distribution of X, respectively.

Standard error of mean

A dispersion measure usually used to evaluate the precision of a mean estimate. It is obtained by the ratio between the standard deviation and the square root of the number of subjects under study.

Standard deviation

The square root of the variance (defined below).

It is often used instead of the Variance as it has the same unit of measure as X.

Standard error of standard deviation

Provides a measure of the uncertainty of the Standard deviation estimate.

It is seldom used.

Variance

It is among the most commonly used dispersion measures. It is estimated by the following equation (sample variance):

Standard error of variance

A measure of the precision of the variance estimate.

It is seldom used.

Coefficient of variation

The ratio between the standard error and the mean of an observed distribution. It represents a dimensionless measure of dispersion and it should be calculated only for continuous attributes, which take positive values.

Mean absolute deviation

The mean of the absolute values of the difference between each value and the average of the attribute X.

It is seldom used in favor of the Variance measure as it has better statistical properties.

Median absolute deviation

The median value of the absolute values of the difference between each value and the average of the attribute X.

It is a very rarely used as a measure of dispersion.

Descriptive, location and central tendency measures

Number of distinct values

The number of non coincident (tied) values.

Number of missing values

The number of missing values.

Mean value

The simple arithmetic (unweighted) mean of the observed X values:

Absolute mean value

The mean value obtained from the absolute value of each observation.

Geometric mean value

A central tendency measure that should be calculated for positive values only:

It is often applied to evaluate the average mean time needed to perform a specific task or as a measure of average speed.

Geometric absolute mean value

The geometric mean value obtained from the absolute value of each observation.

It is seldom used.

Harmonic mean value

The harmonic mean value obtained from the value of each observation.

Harmonic absolute mean value

The harmonic mean value obtained from the absolute value of each observation.

It is seldom used.

Median value

The value that makes it possible to split the X distribution into two equally sized samples (or almost equally sized, in the presence of an odd number of distinct values), corresponding respectively to the values ≤ median and > median.

If there is an odd number of distinct values, the median belongs to the observed distribution. For example, if there are 5 valid items of data, the third ordered value corresponds to the median. Otherwise the median value is estimated by a simple interpolation, for example if there are 6 distinct values, the average between the 3rd and the 4th ordered value is used.

Mode value

The most frequent observation.

Index of the mode element

The position of the most frequently observed value in the original (unsorted) X attribute.

Minimum value

The valid lowest value observed.

Index of the minimum element

The position of the lowest observed value in the original (unsorted) X attribute.

Maximum value

The valid highest value observed.

Index of the maximum element

The position of the highest observed value in the original (unsorted) X attribute.

Sum value

The sum of all valid observations.

Absolute sum value

The sum of the absolute value of each observation.

Product value

The product of all valid observations.

Absolute product value

The absolute value of the product between each observed value.

Lower quartile

The value that separates the quarter of ordered data with the lowest values from the 75% of the remaining observations. It is also called “the 25th percentile” and is used together with the upper quartile and the median value to obtain a box plot.

For further information on boxplots see Plotting Data - Box Plots.

Upper quartile

The value that separates the quarter of ordered data with the highest values from the 75% of the remaining observations. It is also called “the 75th percentile” and is used together with the lower quartile and the median value to obtain a box plot.

Lower whisker for boxplot

The value corresponding to three times the standard deviation below the mean. It is used as a threshold to detect outliers detection and is part of the box plot.

Upper whisker for boxplot

The value corresponding to three times the standard deviation above the mean. It is used as a threshold to detect outliers and is part of the box plot.

Symmetry and shape measures

Skewness value

A measure of asymmetry.

Symmetric distributions have zero skewness, whereas positive values are associated with right asymmetry.

It is obtained by the following equation:

Standard error of skewness

An estimate of the precision of the skewness value estimate.

Kustosis value

A measure of how data are peaked. For example, kurtosis of a standard Gaussian distribution is 3.0. Distribution with a higher kurtosis is said to be leptokurtic while distributions with a lower value are called platykurtic.

The Kurtosis value is obtained by the following equation:

Standard error of kurtosis

A measure of the precision of the Kurtosis estimate.