EDUCBA

EDUCBA

MENUMENU
  • Explore
    • Lifetime Membership
    • All in One Bundles
    • Fresh Entries
    • Finance
    • Data Science
    • Programming and Dev
    • Excel
    • Marketing
    • HR
    • PDP
    • VFX and Design
    • Project Management
    • Exam Prep
    • All Courses
  • Blog
  • Enterprise
  • Free Courses
  • Log in
  • Sign up
Home Data Science Data Science Tutorials Machine Learning Tutorial Machine Learning Datasets

Machine Learning Datasets

Priya Pedamkar
Article byPriya Pedamkar

Updated November 16, 2023

machine learning datasets

Introduction to Machine Learning Datasets

The following article provides an outline for Machine Learning Datasets. Machine learning dataset is defined as the collection of data that is needed to train the model and make predictions. These datasets are classified as structured and unstructured datasets, where the structured datasets are in tabular format in which the row of the dataset corresponds to record and column corresponds to the features, and unstructured datasets corresponds to the images, text, speech, audio, etc. which is acquired through Data Acquisition, Data Wrangling and Data Exploration, during the learning process these datasets are divided as training, validation and test sets for the training and measuring the accuracy of the mode.

ADVERTISEMENT
Popular Course in this category
MACHINE LEARNING Course Bundle - 57 Courses in 1 | 32 Mock Tests

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Following are the three main steps needed in data analysis:

  • Data Acquisition
  • Data Wrangling or Data Pre-Processing
  • Data Exploration

As an output of data analysis, we will be having a relevant dataset that can be used in the training of the model.

Types of Datasets

In Machine Learning while training a model we often encounter the problem of over-fitting and underfitting.

In order to overcome the situation, we need to divide our dataset into 3 different parts:

  • Training Dataset
  • Validation Dataset
  • Test Dataset

The division of the dataset into the above three categories is done in the ratio of 60:20:20.

1. Training Dataset

  • This data set is used to train the model i.e. these datasets are used to update the weight of the model.

2. Validation Dataset

  • These types of a dataset are used to reduce overfitting. It is used to verify that the increase in the accuracy of the training dataset is actually increased if we test the model with the data that is not used in the training.
  • If the accuracy over the training dataset increase while the accuracy over the validation dataset decrease, then this results in the case of high variance i.e. overfitting.

3. Test Dataset

  • Most of the time when we try to make changes to the model based upon the output of the validation set then unintentionally we make the model peek into our validation set and as a result, our model might get overfit on the validation set as well.
  • To overcome this issue we have a test dataset that is only used to test the final output of the model in order to confirm the accuracy.

Dataset structure and properties are defined by the various characteristics, like the attributes or features. Dataset is generally created by manual observation or might sometimes be created with the help of the algorithm for some application testing. Data available in the dataset can be numerical, categorical, text, or time series. For example, in predicting the car price the values will be numerical. In the dataset, each row corresponds to an observation or a sample.

Types of Data

Let’s see the type of data available in the datasets from the perspective of machine learning.

1. Numerical Data

Any data points which are numbers are termed numerical data. Numerical data can be discrete or continuous. Continuous data has any value within a given range while discrete data is supposed to have a distinct value. For example, the number of doors of cars will be discrete i.e. either two, four, six, etc. and the price of the car will be continuous that is might be 1000$ or 1250.5$. The data type of numerical data is int64 or float64.

2. Categorical Data

Categorical data are used to represent the characteristics. For example car color, date of manufacture, etc. It can also be a numerical value provided the numerical value is indicating a class. For example, 1 can be used to denote a gas car and 0 for a diesel car. We can use categorical data to forms groups but cannot perform any mathematical operations on them. Its data type is an object.

3. Time Series Data

It is the collection of a sequence of numbers collected at a regular interval over a certain period of time. It is very important, like in the field of the stock market where we need the price of a stock after a constant interval of time. The type of data has a temporal field attached to it so that the timestamp of the data can be easily monitored.

4. Text Data

Text data is nothing but literals. The first step of handling test data is to convert them into numbers as or model is mathematical and needs data to inform of numbers. So to do so we might use functions as a bag of word formulation.

Various Sources of Dataset

It is quite often hard to find the dataset for the machine learning application.

Following are the few lists of datasets along with their descriptions that can be used for experimentation.

1. Google Dataset Search Engine

Link: https://datasetsearch.research.google.com/

Google has its own search engine for the dataset. Their objective was to unify almost all the available dataset repositories and make them discoverable. One can easily search for the dataset based upon the application of their learning model.

2. Microsoft Dataset

Link: https://msropendata.com/

Microsoft has Microsoft Research Open Data. It is a data repository that makes the dataset created by the researchers at Microsoft available to the data scientists. Over here one can get a bunch of curated datasets.

3. Computer Vision Dataset

Link: https://visualdata.io/

This source provides a dataset of images. If you plan to work on image processing, deep learning or computer vision you can use this source. There are great visual datasets that are available to build computer vision models.

4. Kaggle Dataset

Link: https://www.kaggle.com/datasets

It contains numerous amounts of data with different shapes and sizes. Most of the available dataset has kernels associated with them, where many data scientist has provided their notebooks to analyze the dataset.

5. Amazon Dataset

Link: https://registry.opendata.aws/

It contains a dataset from the field of public transport, satellite images, etc. These datasets are available on the Amazon Web Service resource like Amazon S3. It becomes handy if you plan to use AWS for machine learning experimentation and development.

Conclusion – Machine Learning Datasets

In this article, we understood the machine learning database and the importance of data analysis. We have also seen the different types of datasets and data available from the perspective of machine learning. In the end, you have a various sources which can be used to avail the dataset for the experimentation and development of machine learning models.

Recommended Articles

This is a guide to Machine Learning Datasets. Here we discuss different types of datasets and data along with the various source of machine learning datasets. You may also look at the following articles to learn more –

  1. Data Preprocessing in Machine Learning
  2. Kernel methods in Machine Learning
  3. Hyperparameter Machine Learning
  4. Statistics for Machine Learning
ADVERTISEMENT
Ai ARTIFICIAL INTELLIGENCE Course Bundle - 7 Courses in 1 | 3 Mock Tests
49+ Hours of HD Videos
7 Courses
3 Mock Tests & Quizzes
Verifiable Certificate of Completion
Lifetime Access
4.5
ADVERTISEMENT
PYTHON for Machine Learning Course Bundle - 39 Courses in 1 | 6 Mock Tests
125+ Hour of HD Videos
39 Courses
6 Mock Tests & Quizzes
Verifiable Certificate of Completion
Lifetime Access
4.8
ADVERTISEMENT
All-in-One Data Science Bundle - 400+ Courses | 550+ Mock Tests | 2000+ Hours | Lifetime |
2000+ Hour of HD Videos
80 Learning Paths
400+ Courses
Verifiable Certificate of Completion
Lifetime Access
4.7
ADVERTISEMENT
MS Excel & VBA for Data Science Course Bundle - 24 Courses in 1 | 10 Mock Tests
87+ Hours of HD Videos
24 Courses
10 Mock Tests & Quizzes
Verifiable Certificate of Completion
Lifetime Access
4.5
Primary Sidebar
Footer
Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Blog as Guest
Courses
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2023 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

🚀 Extended Cyber Monday Price Drop! All in One Universal Bundle (3700+ Courses) @ 🎁 90% OFF - Ends in ENROLL NOW