Difference between Small Data and Big Data
Small Data, is nothing but the data that are small enough comprehensive for human in a volume and also for formatting, that makes it accessible, informative and actionable. The traditional data processing cannot deal with large or complex data, these data are termed as to be Big Data. When Data volume grows beyond a certain limit traditional systems and methodologies are not enough to process data or transform data into a useful format. This is why data generally categorized into two – Small Data vs Big Data
Head to Head Comparison Between Small Data and Big Data (Infographics)
Below is the top 10 difference between Small Data and Big Data:
Key Differences between Small Data and Big Data
The differences between Small Data and Big Data are explained in the points presented below:
- Data Collection – Usually Small Data is part of OLTP systems and collected in a more controlled manner then inserted to the caching layer or database. Databases will have read replicas to support immediate analytics queries if needed. Big Data collection pipeline will have queues like AWS Kinesis or Google Pub/Sub to balance high-velocity data. Downstream will have streaming pipelines for real-time analytics and batch jobs for cold data processing.
- Data Processing – As most of the Small Data generated through the transaction system, analytics on top of it will be batch-oriented most of the time. In some rare cases, analytics queries run directly on top of transaction systems. Big Data environments will have both batch and stream processing pipelines. A stream is used for real-time analytics like credit card fraud detection or stock price prediction. Batch processing used for implementing complex business logic with data and advanced algorithms.
- Scalability – Small Data systems typically scale vertically. Vertical scaling is increasing system capacity by adding more resources to the same machine. Vertical scaling is costly but lesser complex to manage. Big Data systems mostly depend on horizontally scalable architecture which gives more agility at a lesser cost. Pre-emptive virtual machines available in the cloud make horizontally scalable systems even more affordable.
- Data modeling – Small Data generated from transaction systems will be in normalized form.ETL(Extract Transform Load) data pipelines convert it into star or snowflake schema in a data warehouse. Here schema is always enforced while writing data which is relatively easy as data is more structured. As mentioned above, tabular data is only a fraction of Big Data. Here data is replicated much more for various reasons like failure handover or due to some limitation of the underlying database engine(For example, some database only support one secondary index per dataset). A schema is not enforced when writing. Instead, a schema is validated while reading data.
- Storage & Computation Coupling – In traditional databases that mostly handle Small Data, storage and computing are tightly coupled.Insertion and retrieval of data to and from the database only possible through the given interface. Data cannot be put directly into the database filesystem, or existing data cannot be queried using other DB engines. Actually, this architecture greatly helps to ensure data integrity. Big Data systems have very loose coupling between storage and compute. Usually, data is stored in a distributed data storage system like HDFS, AWS S3 or Google GCS and compute engine to query data or do ETL selected at a later time. For example, interactive queries might be executed using Presto(Link) and ETL using Apache Hive on the same data.
- Data Science – Machine learning algorithms require input data in a well structured and properly encoded format, and most of the time input data will be from both transactional systems like a data warehouse and Big Data storage like a data lake. Machine learning algorithms running solely on Small Data will be easy as the data preparation stage is narrow. Prepare and enriching data in the Big Data environment takes much more time. Big Data gives a lot of options for data science experimentation due to the high volume and variety of data.
- Data Security – Security practices for Small Data which is residing on enterprise data warehouse or transaction systems provided by corresponding database providers that might include user privileges, data encryption, hashing, etc. Securing Big Data systems are much more complicated and challenging. Security best practices include encrypting data at rest and transit, isolate cluster network, strong access control rules, etc.
Small Data and Big Data Comparison Table
Below are the lists of points, describe the comparisons between Small Data and Big Data.
|Basis Of Comparison||Small Data||Big Data|
|Definition||Data that is ‘small’ enough for human comprehension.In a volume and format that makes it accessible, informative and actionable||Data sets that are so large or complex that traditional data processing applications cannot deal with them|
|Data Source||● Data from traditional enterprise systems like
○ Enterprise resource planning
○ Customer relationship management(CRM)
● Financial Data like general ledger data
● Payment transaction data from website
|● Purchase data from point-of-sale
● Clickstream data from websites
● GPS stream data – Mobility data sent to a server
● Social media – Facebook, Twitter
|Volume||Most cases in a range of tens or hundreds of GB.Some case few TBs ( 1 TB=1000 GB)||More than a few Terabytes (TB)|
|Velocity (Rate at which data appears)||● Controlled and steady data flow
● Data accumulation is slow
|● Data can arrive at very fast speeds.
● Enormous data can accumulate within very short periods of time
|Variety||Structured data in tabular format with fixed schema and semi-structured data in JSON or XML format||High variety data sets which include Tabular data,Text files, Images, Video, Audio,XML,JSON,Logs,Sensor data etc.|
|Veracity (Quality of data )||Contains less noise as data collected in a controlled manner.||Usually, the quality of data not guaranteed. Rigorous data validation is required before processing.|
|Value||Business Intelligence, Analysis, and Reporting||Complex data mining for prediction, recommendation, pattern finding, etc.|
|Time Variance||Historical data equally valid as data represent solid business interactions||In some cases, data gets older soon(Eg fraud detection).|
|Data Location||Databases within an enterprise, Local servers, etc.||Mostly in distributed storages on Cloud or in external file systems.|
|Infrastructure||Predictable resource allocation.Mostly vertically scalable hardware||More agile infrastructure with a horizontally scalable architecture. Load on the system varies a lot.|
The ultimate goal for data analysis to get timely insights to support decision making. Categorizing data into Small and Big help to tackle challenges in analyzing data of each world separately with proper tools. The line between two categories varies with emerging advanced data processing systems which makes even big data querying much faster and less complex.
This has been a guide to Small Data vs Big Data. Here we have discussed Small Data vs Big Data head to head comparison, key difference along with infographics and comparison table. You may also look at the following articles to learn more –