TipsMake
Newest

Line Learning Data

Up to 80% of a Machine Learning project involves data collection:

  1. What data is needed?
  2. What data is available?
  3. How do I select the data?
  4. How do we collect data?
  5. How do I clean up data?
  6. How do I prepare the data?
  7. How can we use the data?

What is data?

Data can be many things. In Machine Learning, data is a collection of events:

 

Type For example
Number Price. Date.
Size Dimensions. Height. Weight.
Vocabulary Names and places.
Observe Count the cars.
Describe It's cold.

Intelligence needs data.

Human intelligence needs data: A real estate agent needs data on homes that have been sold to estimate prices.

Artificial intelligence also needs data: A machine learning program needs data to estimate prices.

 

  1. Data can help us see and understand.
  2. Data can help us identify new opportunities.
  3. Data can help us resolve misunderstandings.

Healthcare

The healthcare and life sciences industries collect public health data and patient data to learn how to improve patient care and save lives.

Business

The most successful companies in many fields are data-driven. They use sophisticated data analytics to understand how the company can perform better.

Finance

Banks and insurance companies collect and evaluate data on customers, loans, and deposits to support strategic decision-making.

Data storage

The most common data collected are Numbers and Sizes. Typically, this data is stored in arrays that represent the relationships between the values.

This table shows house prices compared to area:

Price 7 8 8 9 9 9 10 11 14 14 15
Size 50 60 70 80 90 100 110 120 130 140 150

 

Quantitative data versus qualitative data

Quantitative data is numerical data:

  1. 55 cars
  2. 15 meters
  3. 35 children

Qualitative data is descriptive data:

  1. It's cold.
  2. It's long
  3. That's fun!

Inventory or sampling

Line Learning Data Picture 1

Inventory is when we collect data for every member of a team.

Sampling is when we collect data for a number of members of a group.

If we want to know how many Americans smoke, we can ask everyone in America (county), or we can ask 10,000 people (sampling).

Accurate inventory is difficult to implement. Sampling is inaccurate but easier to implement.

Sampling terminology

Population is a group of individuals (subjects) from whom we want to collect information.

An inventory is information about every individual within a group of people being surveyed.

 

Sampling is information about a portion of the total population surveyed (to represent the whole).

Random sample

For a sample to be representative of the total number of people surveyed, it must be collected randomly.

A random sample is a sample in which each member of the total number of people surveyed has an equal chance of appearing in the sample.

Sampling error

Sampling bias (error) occurs when samples are collected in a way that makes some individuals less (or more) likely to be included in the sample than others are.

Big data

Big data is data that humans cannot process without the assistance of advanced machines.

Big data doesn't have a specific size definition, but datasets are constantly getting larger as we continuously collect more data and store it at increasingly lower costs.

Data mining

Big data comes with complex data structures.

A large part of the Big Data processing process involves data refinement.

Discover more Machine Learning
Kareem Winters
Share by Kareem Winters
Update 07 March 2026