Machine Learning Data
Up to 80% of a Machine Learning project is about Collecting Data:
- What data is Required?
- What data is Available?
- How to Select the data?
- How to Collect the data?
- How to Clean the data?
- How to Prepare the data?
- How to Use the data?
What is Data?
Data can be many things.
With Machine Learning, data is collections of facts:
|Measurements||Size. Height. Weight.|
|Words||Names and Places.|
|Descriptions||It is cold.|
Intelligence Needs Data
Human intelligence needs data:
A real estate broker needs data about sold houses to estimate prices.
Artificial Intelligence also needs data:
A Machine Learning program needs data to estimate prices.
Data can help us to see and understand.
Data can help us to find new opportunities.
Data can help us to resolve misunderstandings.
Healthcare and life sciences collect public health data and patient data to learn how to improve patient care and save lives.
The most successful companies in many sectors are data driven. They use sophisticated data analytics to learn how the company can perform better.
Banks and insurance companies collect and evaluate data about customers, loans and deposits to support strategic decision-making.
The most common data to collect are Numbers and Measurements.
Often data are stored in arrays representing the relationship between values.
This table contains house prices versus size:
Quantitative vs. Qualitative
Quantitative data are numerical:
- 55 cars
- 15 meters
- 35 children
Qualitative data are descriptive:
- It is cold
- It is long
- It was fun
Census or Sampling
A Census is when we collect data for every member of a group.
A Sample is when we collect data for some members of a group.
If we wanted to know how many Americans smoke cigarettes, we could ask every person in the US (a census), or we could ask 10 000 people (a sample).
A census is Accurate, but hard to do. A sample is Inaccurate, but is easier to do.
A Population is group of individuals (objects) we want to collect information from.
A Census is information about every individual in a population.
A Sample is information about a part of the population (In order to represent all).
In order for a sample to represent a population, it must be collected randomly.
A Random Sample, is a sample where every member of the population has an equal chance to appear in the sample.
A Sampling Bias (Error) occurs when samples are collected in such a way that some individuals are less (or more) likely to be included in the sample.
Big data is data that is impossible for humans to process without the assistance of advanced machines.
Big data does not have any definition in terms of size, but datasets are becoming larger and larger as we continously collect more and more data and store data at a lower and lower cost.
With big data comes complicated data structures.
A huge part of big data processing is refining data.