- Noisy data is data with a large amount of additional meaningless information in it called noise.
- This includes data corruption and the term is often used as a synonym for corrupt data.
- It also includes any data that a user system cannot understand and interpret correctly.
Sources of Noise:
Noise has two main sources: errors introduced by measurement tools and random errors introduced by processing or by experts when the data is gathered.
Outlier data is data that appears to not belong in the data set. It can be caused by human error such as transposing numerals, mislabeling, programming bugs, etc. If actual outliers are not removed from the data set, they corrupt the results to a small or large degree depending on circumstances. If valid data is identified as an outlier and is mistakenly removed, that also corrupts results.
Fraud: Individuals may deliberately skew data to influence the results toward a desired conclusion. Data that looks good with few outliers reflects well on the individual collecting it, and so there may be an incentive to remove more data as outliers or make the data look smoother than it is.