Line Learning Data

Table of Contents

What is data?
Intelligence needs data.
Healthcare
Business
Finance
Data storage
Quantitative data versus qualitative data
Inventory or sampling
Sampling terminology
Random sample
Sampling error
Big data
Data mining

Up to 80% of a Machine Learning project involves data collection:

What data is needed?
What data is available?
How do I select the data?
How do we collect data?
How do I clean up data?
How do I prepare the data?
How can we use the data?

What is data?

Data can be many things. In Machine Learning, data is a collection of events:

Type	For example
Number	Price. Date.
Size	Dimensions. Height. Weight.
Vocabulary	Names and places.
Observe	Count the cars.
Describe	It's cold.

Intelligence needs data.

Human intelligence needs data: A real estate agent needs data on homes that have been sold to estimate prices.

Artificial intelligence also needs data: A machine learning program needs data to estimate prices.

Data can help us see and understand.
Data can help us identify new opportunities.
Data can help us resolve misunderstandings.

Healthcare

The healthcare and life sciences industries collect public health data and patient data to learn how to improve patient care and save lives.

Business

The most successful companies in many fields are data-driven. They use sophisticated data analytics to understand how the company can perform better.

Finance

Banks and insurance companies collect and evaluate data on customers, loans, and deposits to support strategic decision-making.

Data storage

The most common data collected are Numbers and Sizes. Typically, this data is stored in arrays that represent the relationships between the values.

This table shows house prices compared to area:

Price	7	8	8	9	9	9	10	11	14	14	15
Size	50	60	70	80	90	100	110	120	130	140	150

Quantitative data versus qualitative data

Quantitative data is numerical data:

55 cars
15 meters
35 children

Qualitative data is descriptive data:

It's cold.
It's long
That's fun!

Inventory or sampling

images 1 of Line Learning Data

Inventory is when we collect data for every member of a team.

Sampling is when we collect data for a number of members of a group.

If we want to know how many Americans smoke, we can ask everyone in America (county), or we can ask 10,000 people (sampling).

Accurate inventory is difficult to implement. Sampling is inaccurate but easier to implement.

Sampling terminology

Population is a group of individuals (subjects) from whom we want to collect information.

An inventory is information about every individual within a group of people being surveyed.

Sampling is information about a portion of the total population surveyed (to represent the whole).

Random sample

For a sample to be representative of the total number of people surveyed, it must be collected randomly.

A random sample is a sample in which each member of the total number of people surveyed has an equal chance of appearing in the sample.

Sampling error

Sampling bias (error) occurs when samples are collected in a way that makes some individuals less (or more) likely to be included in the sample than others are.

Big data

Big data is data that humans cannot process without the assistance of advanced machines.

Big data doesn't have a specific size definition, but datasets are constantly getting larger as we continuously collect more data and store it at increasingly lower costs.

Data mining

Big data comes with complex data structures.

A large part of the Big Data processing process involves data refinement.

Machine Learning