Tutorials References Exercises Bootcamps Videos Menu
Sign Up Create Website Get Certified Upgrade

Machine Learning Data

Up to 80% of a Machine Learning project is about Collecting Data:

  • What data is Required?
  • What data is Available?
  • How to Select the data?
  • How to Collect the data?
  • How to Clean the data?
  • How to Prepare the data?
  • How to Use the data?

What is Data?

Data can be many things.

With Machine Learning, data is collections of facts:

NumbersPrices. Dates.
MeasurementsSize. Height. Weight.
WordsNames and Places.
ObservationsCounting Cars.
DescriptionsIt is cold.

Intelligence Needs Data

Human intelligence needs data:

A real estate broker needs data about sold houses to estimate prices.

Artificial Intelligence also needs data:

A Machine Learning program needs data to estimate prices.

Data can help us to see and understand.

Data can help us to find new opportunities.

Data can help us to resolve misunderstandings.


Healthcare and life sciences collect public health data and patient data to learn how to improve patient care and save lives.


The most successful companies in many sectors are data driven. They use sophisticated data analytics to learn how the company can perform better.


Banks and insurance companies collect and evaluate data about customers, loans and deposits to support strategic decision-making.

Storing Data

The most common data to collect are Numbers and Measurements.

Often data are stored in arrays representing the relationship between values.

This table contains house prices versus size:

Size5060708090100 110120130140150

Quantitative vs. Qualitative

Quantitative data are numerical:

  • 55 cars
  • 15 meters
  • 35 children

Qualitative data are descriptive:

  • It is cold
  • It is long
  • It was fun

Census or Sampling

A Census is when we collect data for every member of a group.

A Sample is when we collect data for some members of a group.

If we wanted to know how many Americans smoke cigarettes, we could ask every person in the US (a census), or we could ask 10 000 people (a sample).

A census is Accurate, but hard to do. A sample is Inaccurate, but is easier to do.

Sampling Terms

A Population is group of individuals (objects) we want to collect information from.

A Census is information about every individual in a population.

A Sample is information about a part of the population (In order to represent all).

Random Samples

In order for a sample to represent a population, it must be collected randomly.

A Random Sample, is a sample where every member of the population has an equal chance to appear in the sample.

Sampling Bias

A Sampling Bias (Error) occurs when samples are collected in such a way that some individuals are less (or more) likely to be included in the sample.

Big Data

Big data is data that is impossible for humans to process without the assistance of advanced machines.

Big data does not have any definition in terms of size, but datasets are becoming larger and larger as we continously collect more and more data and store data at a lower and lower cost.

Data Mining

With big data comes complicated data structures.

A huge part of big data processing is refining data.