Basic Statistics and Data Wrangling

Table Of Contents

Revision of Basic Statistics
Data Wrangling
Code Exercises

Revision of Basic Statistics

Types of Data

Continuous Data→ Data that can take any value in an intervall (also float, numeric, interval data)
Discrete Data→ Data that can take only integer values (also integer, count)→ is our most used data type
Categorial Data→ Data that can take only predefined values representing a set of categories (also enums, enummerated, nominal data)→ Special case: binary data (also boolean, logical indicator)→ Ordinal Data: Categorial data with explicit ordering (e.g. small, medium, large)

! in our models, we have to deal with the type of data. Continuous Data and discrete data is easier than ordinal data

Simple Statistical Characteristics

mean / average→ trimmed mean: removes potential outliers→ weighted mean: what weight (factor) is given by some prior information, e.g. trust in the source
median (middle element of set)→ much more expensive to calculate (need to sort first)
variance→ measure for variability, how far a set of numbers are spread out from the mean→ it is mandatory for us to look at the variance
Standard Deviation→ is closly coupled with variance→ is the spread of a group of numbers from the mean
Range→ difference between maximum and minimum values
Percentile→ a value such that p percent of the values take on this value or less

→ with the given characteristics, we can create

Box-Plots, visualizing mean and quantile and outlier information
Histograms, visualizing data distribution (bin count)

(based on numerical data, not possible with categorial data)

Multivariate Statistics

→ characteristics we can get for the relation between variables

Covariance→ measure of the joint variability of two random variables (If one variable goes up, does the other also go up? Yes → Covariance is positive)→ Covariance Matrix
Correlation→ how a pair of variables are related (e.g. linear)→ is the normalized covariance
- most famous: Person’s correlation coefficient
- Correlation does not implicate causality (e.g. rain → umbrella, but not umbrella → rain
Simpson Paradox→ an effect in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

Data Sampling

how to sample data for experience?
how much data do I need?
But I have Big Data- do I still need to sample?

Terms:

Sample: a subset of a larger data set
Population: the large (full) data set – or at least the model for it
Random Sample: Drawing samples at random
Stratified sampling: Dividing samples into
strata (groups with shared properties) and sample from them
Sample Bias A sample that has different statistics from the population

We need to prevent sample bias!

→ e.g. with random sampling

Rules for random sampling:

define theoretically possible populations
- what external effects could influence the population (e.g. different sources)
- need context knowledge on the data source!
compare basic statistics of different samplings (Mean, Variance, …) is it stable?
- but this will require more data …

Alternative: Bootstrap Sample

→ a sample which has been taken with replacement

replacement: a data point can be drawn more then once
algorithm:
1. sample data points with replacement save sample as bag
2. compute statistics (e.g. mean, variance)
3. repeat 1+2 and observe statistics of different bags
effect of bootstrap: allows robust estimation of unknown distributions
later for prediction: use bagging (= average over models trained on many bags)

Data Distributions

Normal Distribution

→ how much of the data is within the standard deviations

The assumption that data follows a normal distribution is

a necessary condition for many analysis methods
but rarely secure for real data

Testing for a Normal Distribution:

many different approaches
easy method: QQ-Plotsgraphical method for comparing two probability distributions by plotting their quantiles against each other.→ the more the points align themselves on the line, the better is the normal distribution

Why we can often assume a normal distribution:

→ because of the central limit theorem (CLT), which says that when independent random variables are added, their properly normalized sum tends toward a normal distribution (even if the original variables themselves are not normally distributed)

Data Wrangling

What is Data Wrangling
- Stages
Short Introduction to Pandas (Lab)
Wrangling by Use Cases (Lab)

What Wrangling is

in our Data Science Processing Pipline, it’s the step of “pre-process and cleaning”
is the process of transforming and mapping data from one “raw” data form into another format which we can analyse

Stages of Data Wrangling:

(sometimes are not all steps needed)

(Scrape) → get data from sensors, internet, databases,…
Clean → remove “bad” data (missing values, wrong format, exceptions, special cases,….)
Transform → change/correct data formats, recompute,…
Merge → combine and connect data sources
Reshape → Rectify, output: vectors, arrays, tables