EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

Machine Learning Datasets

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Machine Learning Tutorial » Machine Learning Datasets

machine learning datasets

Introduction to Machine Learning Datasets

The following article provides an outline for Machine Learning Datasets. Machine learning dataset is defined as the collection of data that is needed to train the model and make predictions. These datasets are classified as structured and unstructured datasets, where the structured datasets are in tabular format in which the row of the dataset corresponds to record and column corresponds to the features, and unstructured datasets corresponds to the images, text, speech, audio, etc. which is acquired through Data Acquisition, Data Wrangling and Data Exploration, during the learning process these datasets are divided as training, validation and test sets for the training and measuring the accuracy of the mode.

Following are the three main steps needed in data analysis:

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

  • Data Acquisition
  • Data Wrangling or Data Pre-Processing
  • Data Exploration

As an output of data analysis, we will be having a relevant dataset that can be used in the training of the model.

Types of Datasets

In Machine Learning while training a model we often encounter the problem of over-fitting and underfitting.

In order to overcome the situation, we need to divide our dataset into 3 different parts:

  • Training Dataset
  • Validation Dataset
  • Test Dataset

The division of the dataset into the above three categories is done in the ratio of 60:20:20.

Popular Course in this category
Sale
Machine Learning Training (19 Courses, 29+ Projects)19 Online Courses | 29 Hands-on Projects | 178+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.7 (13,788 ratings)
Course Price

View Course

Related Courses
Deep Learning Training (16 Courses, 24+ Projects)Artificial Intelligence Training (5 Courses, 2 Project)

1. Training Dataset

  • This data set is used to train the model i.e. these datasets are used to update the weight of the model.

2. Validation Dataset

  • These types of a dataset are used to reduce overfitting. It is used to verify that the increase in the accuracy of the training dataset is actually increased if we test the model with the data that is not used in the training.
  • If the accuracy over the training dataset increase while the accuracy over the validation dataset decrease, then this results in the case of high variance i.e. overfitting.

3. Test Dataset

  • Most of the time when we try to make changes to the model based upon the output of the validation set then unintentionally we make the model peek into our validation set and as a result, our model might get overfit on the validation set as well.
  • To overcome this issue we have a test dataset that is only used to test the final output of the model in order to confirm the accuracy.

Dataset structure and properties are defined by the various characteristics, like the attributes or features. Dataset is generally created by manual observation or might sometimes be created with the help of the algorithm for some application testing. Data available in the dataset can be numerical, categorical, text, or time series. For example, in predicting the car price the values will be numerical. In the dataset, each row corresponds to an observation or a sample.

Types of Data

Let’s see the type of data available in the datasets from the perspective of machine learning.

1. Numerical Data

Any data points which are numbers are termed numerical data. Numerical data can be discrete or continuous. Continuous data has any value within a given range while discrete data is supposed to have a distinct value. For example, the number of doors of cars will be discrete i.e. either two, four, six, etc. and the price of the car will be continuous that is might be 1000$ or 1250.5$. The data type of numerical data is int64 or float64.

2. Categorical Data

Categorical data are used to represent the characteristics. For example car color, date of manufacture, etc. It can also be a numerical value provided the numerical value is indicating a class. For example, 1 can be used to denote a gas car and 0 for a diesel car. We can use categorical data to forms groups but cannot perform any mathematical operations on them. Its data type is an object.

3. Time Series Data

It is the collection of a sequence of numbers collected at a regular interval over a certain period of time. It is very important, like in the field of the stock market where we need the price of a stock after a constant interval of time. The type of data has a temporal field attached to it so that the timestamp of the data can be easily monitored.

4. Text Data

Text data is nothing but literals. The first step of handling test data is to convert them into numbers as or model is mathematical and needs data to inform of numbers. So to do so we might use functions as a bag of word formulation.

Various Sources of Dataset

It is quite often hard to find the dataset for the machine learning application.

Following are the few lists of datasets along with their descriptions that can be used for experimentation.

1. Google Dataset Search Engine

Link: https://datasetsearch.research.google.com/

Google has its own search engine for the dataset. Their objective was to unify almost all the available dataset repositories and make them discoverable. One can easily search for the dataset based upon the application of their learning model.

2. Microsoft Dataset

Link: https://msropendata.com/

Microsoft has Microsoft Research Open Data. It is a data repository that makes the dataset created by the researchers at Microsoft available to the data scientists. Over here one can get a bunch of curated datasets.

3. Computer Vision Dataset

Link: https://www.visualdata.io/

This source provides a dataset of images. If you plan to work on image processing, deep learning or computer vision you can use this source. There are great visual datasets that are available to build computer vision models.

4. Kaggle Dataset

Link: https://www.kaggle.com/datasets

It contains numerous amounts of data with different shapes and sizes. Most of the available dataset has kernels associated with them, where many data scientist has provided their notebooks to analyze the dataset.

5. Amazon Dataset

Link: https://registry.opendata.aws/

It contains a dataset from the field of public transport, satellite images, etc. These datasets are available on the Amazon Web Service resource like Amazon S3. It becomes handy if you plan to use AWS for machine learning experimentation and development.

Conclusion – Machine Learning Datasets

In this article, we understood the machine learning database and the importance of data analysis. We have also seen the different types of datasets and data available from the perspective of machine learning. In the end, you have a various sources which can be used to avail the dataset for the experimentation and development of machine learning models.

Recommended Articles

This is a guide to Machine Learning Datasets. Here we discuss different types of datasets and data along with the various source of machine learning datasets. You may also look at the following articles to learn more –

  1. Data Preprocessing in Machine Learning
  2. Kernel methods in Machine Learning
  3. Hyperparameter Machine Learning
  4. Statistics for Machine Learning

Machine Learning Training (17 Courses, 27+ Projects)

19 Online Courses

29 Hands-on Projects

178+ Hours

Verifiable Certificate of Completion

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Machine Learning Tutorial
  • Basic
    • Introduction To Machine Learning
    • What is Machine Learning?
    • Uses of Machine Learning
    • Applications of Machine Learning
    • Naive Bayes in Machine Learning
    • Dataset Labelling
    • DataSet Example
    • Dataset ZFS
    • Careers in Machine Learning
    • What is Machine Cycle?
    • Machine Learning Feature
    • Machine Learning Programming Languages
    • What is Kernel in Machine Learning
    • Machine Learning Tools
    • Machine Learning Models
    • Machine Learning Platform
    • Machine Learning Libraries
    • Machine Learning Life Cycle
    • Machine Learning System
    • Machine Learning Datasets
    • Top 7 Useful Benefits Of Machine Learning Certifications
    • Machine Learning Python vs R
    • Optimization for Machine Learning
    • Types of Machine Learning
    • Machine Learning Methods
    • Machine Learning Software
    • Machine Learning Techniques
    • Machine Learning Feature Selection
    • Ensemble Methods in Machine Learning
    • Support Vector Machine in Machine Learning
    • Decision Making Techniques
    • Restricted Boltzmann Machine
    • Regularization Machine Learning
    • What is Regression?
    • What is Linear Regression?
    • Dataset for Linear Regression
    • Decision tree limitations
    • What is Decision Tree?
    • What is Random Forest
  • Algorithms
    • Machine Learning Algorithms
    • Apriori Algorithm in Machine Learning
    • Types of Machine Learning Algorithms
    • Bayes Theorem
    • AdaBoost Algorithm
    • Classification Algorithms
    • Clustering Algorithm
    • Gradient Boosting Algorithm
    • Mean Shift Algorithm
    • Hierarchical Clustering Algorithm
    • Hierarchical Clustering Agglomerative
    • What is a Greedy Algorithm?
    • What is Genetic Algorithm?
    • Random Forest Algorithm
    • Nearest Neighbors Algorithm
    • Weak Law of Large Numbers
    • Ray Tracing Algorithm
    • SVM Algorithm
    • Naive Bayes Algorithm
    • Neural Network Algorithms
    • Boosting Algorithm
    • XGBoost Algorithm
    • Pattern Searching
    • Loss Functions in Machine Learning
    • Decision Tree in Machine Learning
    • Hyperparameter Machine Learning
    • Unsupervised Machine Learning
    • K- Means Clustering Algorithm
    • KNN Algorithm
    • Monty Hall Problem
  • Supervised
    • What is Supervised Learning
    • Supervised Machine Learning
    • Supervised Machine Learning Algorithms
    • Perceptron Learning Algorithm
    • Simple Linear Regression
    • Polynomial Regression
    • Multivariate Regression
    • Regression in Machine Learning
    • Hierarchical Clustering Analysis
    • Linear Regression Analysis
    • Support Vector Regression
    • Multiple Linear Regression
    • Linear Algebra in Machine Learning
    • Statistics for Machine Learning
    • What is Regression Analysis?
    • Clustering Methods
    • Backward Elimination
    • Ensemble Techniques
    • Bagging and Boosting
    • Linear Regression Modeling
    • What is Reinforcement Learning
  • Classification
    • Kernel Methods in Machine Learning
    • Clustering in Machine Learning
    • Machine Learning Architecture
    • Automation Anywhere Architecture
    • Machine Learning C++ Library
    • Machine Learning Frameworks
    • Data Preprocessing in Machine Learning
    • Data Science Machine Learning
    • Classification of Neural Network
    • Neural Network Machine Learning
    • What is Convolutional Neural Network?
    • Single Layer Neural Network
    • Kernel Methods
    • Forward and Backward Chaining
    • Forward Chaining
    • Backward Chaining
  • Deep Learning
    • What Is Deep learning
    • Overviews Deep Learning
    • Application of Deep Learning
    • Careers in Deep Learnings
    • Deep Learning Frameworks
    • Deep Learning Model
    • Deep Learning Algorithms
    • Deep Learning Technique
    • Deep Learning Networks
    • Deep Learning Libraries
    • Deep Learning Toolbox
    • Types of Neural Networks
    • Convolutional Neural Networks
    • Create Decision Tree
    • Deep Learning for NLP
    • Caffe Deep Learning
    • Deep Learning with TensorFlow
  • RPA
    • What is RPA
    • What is Robotics?
    • Benefits of RPA
    • RPA Applications
    • Types of Robots
    • RPA Tools
    • Line Follower Robot
    • What is Blue Prism?
    • RPA vs BPM
  • PyTorch
    • PyTorch Tensors
    • What is PyTorch?
    • PyTorch MSELoss()
    • PyTorch NLLLOSS
    • PyTorch MaxPool2d
    • PyTorch Pretrained Models
    • PyTorch Squeeze
    • PyTorch Reinforcement Learning
    • PyTorch zero_grad
    • PyTorch norm
    • PyTorch VAE
    • PyTorch Early Stopping
    • PyTorch requires_grad
    • PyTorch MNIST
    • PyTorch Conv2d
    • Dataset Pytorch
    • PyTorch tanh
    • PyTorch bmm
    • PyTorch profiler
    • PyTorch unsqueeze
    • PyTorch adam
    • PyTorch backward
    • PyTorch concatenate
    • PyTorch Embedding
    • PyTorch Tensor to NumPy
    • PyTorch Normalize
    • PyTorch ReLU
    • PyTorch Autograd
    • PyTorch Transpose
    • PyTorch Object Detection
    • PyTorch Autoencoder
    • PyTorch Loss
    • PyTorch repeat
    • PyTorch gather
    • PyTorch sequential
    • PyTorch U-NET
    • PyTorch Sigmoid
    • PyTorch Neural Network
    • PyTorch Quantization
    • PyTorch Ignite
    • PyTorch Versions
    • PyTorch TensorBoard
    • PyTorch Dropout
    • PyTorch Model
    • PyTorch optimizer
    • PyTorch ResNet
    • PyTorch CNN
    • PyTorch Detach
    • Single Layer Perceptron
    • PyTorch vs Keras
    • torch.nn Module
  • UiPath
    • What is UiPath
    • UiPath Action Center
    • UiPath?Orchestrator
    • UiPath web automation
    • UiPath Orchestrator API
    • UiPath Delay
    • UiPath Careers
    • UiPath Architecture
    • UiPath version
    • Uipath Reframework
    • UiPath Studio
  • Interview Questions
    • Deep Learning Interview Questions And Answer
    • Machine Learning Cheat Sheet

Related Courses

Machine Learning Training

Deep Learning Training

Artificial Intelligence Training

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Special Offer - Machine Learning Training (17 Courses, 27+ Projects) Learn More