Difference between Small Data and Big Data
From ancient times we humans are interested in collecting, categorizing and visually representing data around us.A volume of data produced every day increasing exponentially. About 90% of the data in the world has been created in last two year and data is coming from everywhere – from social media to IoT devices to GPS signal.
When Data volume grows beyond a certain limit traditional systems and methodologies are not enough to process data or transform data to a useful format.This is why data generally categorized into two – Small Data vs Big Data
Head to Head Comparison Between Small Data vs Big Data (Infographics)
Below is the top 10 Difference between Small Data and Big Data
Key Differences between Small Data vs Big Data
- Data Collection – Usually Small Data is part of OLTP systems and collected in a more controlled manner then inserted to caching layer or database.Databases will have read replicas to support immediate analytics queries if needed. Big Data collection pipeline will have queues like AWS Kinesis or Google Pub/Sub to balance high-velocity data.Downstream will have streaming pipelines for real-time analytics and batch jobs for cold data processing.
- Data Processing – As most of the Small Data generated through transaction system, analytics on top of it will be batch oriented most of the time.In some rare cases, analytics queries run directly on top of transaction systems.Big Data environments will have both batch and stream processing pipelines.A stream is used for real-time analytics like credit card fraud detection or stock price prediction.Batch processing used for implementing complex business logic with data and advanced algorithms.
- Scalability – Small Data systems typically scale vertically.Vertical scaling is increasing system capacity by adding more resources to the same machine.Vertical scaling is costly but lesser complex to manage.Big Data systems mostly depend on horizontally scalable architecture which gives more agility with lesser cost.Pre-emptive virtual machines available in the cloud makes horizontally scalable systems even more affordable.
- Data modeling – Small Data generated from transaction systems will be in normalized form.ETL(Extract Transform Load) data pipelines convert it into star or snowflake schema in a data warehouse.Here schema is always enforced while writing data which is relatively easy as data is more structured. As mentioned above, tabular data is only a fraction of Big Data.Here data is replicated much more for various reasons like failure handover or due to some limitation of the underlying database engine(For example, some database only support one secondary index per dataset). A schema is not enforced when writing. Instead, a schema is validated while reading data.
- Storage & Computation Coupling – In traditional databases which mostly handle Small Data, storage and computing are tightly coupled.Insertion and retrieval of data to and from the database only possible through given interface.Data cannot be put directly into database filesystem, or existing data cannot be queried using other DB engines.Actually, this architecture greatly helps to ensure data integrity. Big Data systems have very loose coupling between storage and compute.Usually, data is stored in a distributed data storage system like HDFS, AWS S3 or Google GCS and compute engine to query data or do ETL selected at a later time.For example, interactive queries might be executed using Presto(Link) and ETL using Apache Hive on same data.
- Data Science – Machine learning algorithms require input data in well structured and properly encoded format, and most of the time input data will be from both transactional systems like a data warehouse and Big Data storage like data lake.Machine learning algorithms running solely on Small Data will be easy as data preparation stage is narrow.Prepare and enriching data in Big Data environment takes much more time.Big Data gives a lot of option for data science experimentation due to high volume and variety of data.
- Data Security – Security practices for Small Data which is residing on enterprise data warehouse or transaction systems provided by corresponding database provider that might include user privileges, data encryption, hashing etc.Securing Big Data systems are much more complicated and challenging.Security best practices include encrypting data at rest and transit, isolate cluster network, strong access control rules etc.
Small Data vs Big Data Comparison Table
|Basis Of Comparison||Small Data||Big Data|
|Definition||Data that is ‘small’ enough for human comprehension.In a volume and format that makes it accessible, informative and actionable||Data sets that are so large or complex that traditional data processing applications cannot deal with them|
|Data Source||● Data from traditional enterprise systems like
○ Enterprise resource planning
○ Customer relationship management(CRM)
● Financial Data like general ledger data
● Payment transaction data from website
|● Purchase data from point-of-sale
● Clickstream data from websites
● GPS stream data – Mobility data sent to server
● Social media – facebook, twitter
|Volume||Most case in a range of tens or hundreds of GB.Some case few TBs ( 1 TB=1000 GB)||More than few Terabytes (TB)|
|Velocity (Rate at which data appears)||● Controlled and steady data flow
● Data accumulation is slow
|● Data can arrive at very fast speeds.
● Enormous data can accumulate within very short periods of time
|Variety||Structured data in tabular format with fixed schema and semi-structured data in JSON or XML format||High variety data sets which include Tabular data,Text files, Images, Video, Audio,XML,JSON,Logs,Sensor data etc.|
|Veracity (Quality of data )||Contains less noise as data collected in a controlled manner.||Usually, quality of data not guaranteed.Rigorous data validation is required before processing.|
|Value||Business Intelligence, Analysis and Reporting||Complex data mining for prediction, recommendation, pattern finding etc.|
|Time Variance||Historical data equally valid as data represent solid business interactions||In some cases data gets older soon(Eg fraud detection).|
|Data Location||Databases within an enterprise, Local servers etc.||Mostly in distributed storages on Cloud or in external file systems.|
|Infrastructure||Predictable resource allocation.Mostly vertically scalable hardware||More agile infrastructure with horizontally scalable architecture.Load on the system varies a lot.|
Conclusion – Small Data vs Big Data
The ultimate goal for data analysis to get timely insights to support decision making. Categorising data into Small and Big help to tackle challenges in analyzing data of each world separately with proper tools.The line between two categories varies with emerging advanced data processing systems which makes even big data querying much faster and less complex.
This has been a guide to Small Data vs Big Data, their Meaning, Head to Head Comparison, Key Differences, Comparision Table, and Conclusion. this article includes all the important Difference between Small Data and Big Data. You may also look at the following articles to learn more –