Line Learning Data
Up to 80% of a Machine Learning project involves data collection:
- What data is needed?
- What data is available?
- How do I select the data?
- How do we collect data?
- How do I clean up data?
- How do I prepare the data?
- How can we use the data?
What is data?
Data can be many things. In Machine Learning, data is a collection of events:
| Type | For example |
|---|---|
| Number | Price. Date. |
| Size | Dimensions. Height. Weight. |
| Vocabulary | Names and places. |
| Observe | Count the cars. |
| Describe | It's cold. |
Intelligence needs data.
Human intelligence needs data: A real estate agent needs data on homes that have been sold to estimate prices.
Artificial intelligence also needs data: A machine learning program needs data to estimate prices.
- Data can help us see and understand.
- Data can help us identify new opportunities.
- Data can help us resolve misunderstandings.
Healthcare
The healthcare and life sciences industries collect public health data and patient data to learn how to improve patient care and save lives.
Business
The most successful companies in many fields are data-driven. They use sophisticated data analytics to understand how the company can perform better.
Finance
Banks and insurance companies collect and evaluate data on customers, loans, and deposits to support strategic decision-making.
Data storage
The most common data collected are Numbers and Sizes. Typically, this data is stored in arrays that represent the relationships between the values.
This table shows house prices compared to area:
| Price | 7 | 8 | 8 | 9 | 9 | 9 | 10 | 11 | 14 | 14 | 15 |
| Size | 50 | 60 | 70 | 80 | 90 | 100 | 110 | 120 | 130 | 140 | 150 |
Quantitative data versus qualitative data
Quantitative data is numerical data:
- 55 cars
- 15 meters
- 35 children
Qualitative data is descriptive data:
- It's cold.
- It's long
- That's fun!
Inventory or sampling
Inventory is when we collect data for every member of a team.
Sampling is when we collect data for a number of members of a group.
If we want to know how many Americans smoke, we can ask everyone in America (county), or we can ask 10,000 people (sampling).
Accurate inventory is difficult to implement. Sampling is inaccurate but easier to implement.
Sampling terminology
Population is a group of individuals (subjects) from whom we want to collect information.
An inventory is information about every individual within a group of people being surveyed.
Sampling is information about a portion of the total population surveyed (to represent the whole).
Random sample
For a sample to be representative of the total number of people surveyed, it must be collected randomly.
A random sample is a sample in which each member of the total number of people surveyed has an equal chance of appearing in the sample.
Sampling error
Sampling bias (error) occurs when samples are collected in a way that makes some individuals less (or more) likely to be included in the sample than others are.
Big data
Big data is data that humans cannot process without the assistance of advanced machines.
Big data doesn't have a specific size definition, but datasets are constantly getting larger as we continuously collect more data and store it at increasingly lower costs.
Data mining
Big data comes with complex data structures.
A large part of the Big Data processing process involves data refinement.
You should read it
- ★ The best Python tools for Machine Learning and Data Science
- ★ [Infographic] AI and Machine Learning in the enterprise
- ★ What is machine learning? What is deep learning? Difference between AI, machine learning and deep learning
- ★ 7 practical applications of Machine Learning
- ★ Google released the TensorFlow machine learning framework specifically for graphical data