NoSQL Overview
Not Only SQL refers to a family of databases that vary widely in style and technology. All share a common trait in that they are non relational in nature. Meaning they are not a standard row and column relational database management system or RDBMS. Therefore, a better name to describe these databases would be non relational.
- NoSQL databases don’t require fixed schemas, making them suitable for evolving use cases.
- NoSQL databases horizontally scale easily, allowing you to add more capacity for data and traffic as the demand grows.
- NoSQL databases are distributed systems that also provide native fault tolerance and availability.
In the late 2000s, several new databases emerged on the scene, many of them from open source communities. Databases like
- Apache Cassandra
- Mongo
- React
- couchDB
- HBase, Redis and neo4j became more prevalently used in applications. Particularly in ones that required larger scale than a relational database could manage.
In the last ten years or so, several NoSQL databases have leveraged a fully managed service model, otherwise called database as a service or DBaaS. Examples include IBM Cloudant and Amazon DynamodB.
Capabilities
- NoSQL capabilities support a flexible data model, which means you can store unstructured or semi structured data more easily.
- Some NoSQL databases provide native, built in horizontal and vertical scaling capabilities.
- Developers can work faster and more productively with data structures that match the application’s needs for reads and writes.
- NoSQL databases work in distributed environments and provide high availability and fault tolerance.
Examples:
- Social media platforms use document databases to manage user profiles.
- Column databases store user activity feed information
- Key value databases help organizations manage user sessions and speed user access to frequently access data.
- Graph databases do the essential work of keeping people connected by storing information about friends and relationships.
Open Source
There is some overlap among these types, so the definition isn’t always clear. One commonality is that the majority of them have their roots in the open-source community and have been used and leveraged in an open-source manner.
This has been fundamental for spring-boarding their growth in the industry. You’ll often see companies who also provide a commercial version of the database, and services and support of the technology, at the same time providing sponsorship of the open-source counterpart. Examples of this include:
- IBM Cloudant for CouchDB
- Datastax for Apache Cassandra
- Mongo has their own open source version of the Mongo database too
Commonalities
Technically speaking they all differ quite a bit, but a few commonalities do emerge.
- Most NoSQL databases are built to scale horizontally and share their data more easily than their relational counterparts. To do this often requires the use of a global unique key across a whole database, to simplify partitioning (or ‘sharding’).
- They’re also more specialized to certain use cases than RDBMS, which previously have been the Swiss army knives of datastores.
- Developers are drawn to NoSQL databases for their ease of data modeling and use.
- Many NoSQL databases allow more agile development through their flexible schemas as compared to the fixed schemas of relational databases.
Why Use NoSQL
Not all NoSQL databases will exhibit all of these benefits:
- First, the most common reason to employ a NoSQL database is for scalability, particularly the ability to horizontally scale across clusters of servers, racks, and possibly even data centers. The elasticity of scaling both up and down to meet the varying demands of applications is key.
- The second point of performance goes hand-in-hand with scalability. The need to deliver fast response times even with large data sets and high concurrency is a must for modern applications, and the ability of NoSQL databases to leverage the resources of large clusters of servers makes them ideal for fast performance in these circumstances.
- High availability is an obvious requirement for a database, and having a database run on a cluster of servers with multiple copies of the data makes for a more resilient solution than a single server solution. Historically, large databases have run on expensive machines or mainframes. Modern enterprises are employing cloud architectures to support their applications, and the distributed data nature of NoSQL databases means that they can be deployed and operated on clusters of servers in cloud architectures, thereby massively reducing cost.
- Cost is important for any technology venture, and it is common to hear of NoSQL adopters cutting significant costs versus their existing databases… and still be able to get the same or better performance and functionality.
- Flexible schema and intuitive data structures are key features that developers love when wanting to build applications efficiently. Most NoSQL databases allow for having flexible schemas, which means that one can build new features into applications quickly and without any database locking or downtime.
- NoSQL databases also have varied data structures, which often are more eloquent for solving development needs than the rows and columns of relational datastores. Examples include key-value stores for quick lookup, document stores for storing de-normalized intuitive information, and graph databases for associative data sets.
- There are also various specialized capabilities that certain NoSQL providers offer that attract end users.
Types
There is a general consensus that they fit into four types: Key-Value, Document, Column-based and Graph style NoSQL databases.
Document NoSQL
Document-store databases, also known as document-oriented databases, store data in a document format, typically JSON or BSON (binary JSON), where each document contains key-value pairs or key-document pairs. These databases are schema-less, allowing flexibility in data structures within a collection.
Characteristics
- Values are visible and can be queried
- Each piece of data is considered a document: typically
- JSON or
- XML
- Each document offers a flexible schema: No two documents are alike nor do they need to contain the same information
- Provides schema flexibility: Documents within collections can have varying structures, allowing for easy updates and accommodation of evolving data requirements.
- Performs efficient create, read, update, and delete (CRUD) operations: well-suited for read and write-intensive applications due to their ability to retrieve whole documents.
- Provides scalability: horizontal scalability by sharding data across clusters.
- Content can be indexed and queried using
- Key and Value range lookups and search
- Analytical queries with MapReduce
- Horizontally scalable
- Allow sharding across multiple nodes
- Typically only guarantee atomic operations on single documents
Use cases
- Content management systems (CMS): CMS platforms like WordPress use document store databases for fast storage and access to content types such as articles, images, and user data. (MongoDB)
- E-commerce: E-commerce platforms need effective management of product catalogs with diverse attributes and hierarchies, accommodating the dynamic nature of e-commerce product listings. (Couchbase or Amazon DocumentDB, using MongoDB compatibility)
- Event logging for apps and processes: each event instance is represented by a new document
- Online blogs: each user, post, comment, like, or action is represented by a document
- Operational datasets and metadata for web and mobile apps
Unsuitable
- When you require ACID transactions
- Cannot handle transactions that operate over multiple documents
- RDBMS would be better suited
- When your data is in an aggregate-oriented design:
- If data naturally falls into a normalized tabular model
Vendors
- MongoDB
- Couchbase
- Amazon DocumentDB
Key-value NoSQL
Key-value stores are the simplest NoSQL databases, storing data as a collection of key-value pairs, where the key is unique and directly points to its associated value.
Characteristics
- Delivers high performance: efficient for read and write operations, optimized for speedy retrieval based on keys
- Provides scalability: easily scalable due to their simple structure and ability to distribute data across nodes
- Uses caching for fast access
- Provides session management
- Works with distributed systems
- Least complex
- Represented as a hashmap
- Ideal for CRUD operations
- Scales well
- Shards easily
Use cases
- Enhanced web performance by caching frequently accessed data (Using Redis or Memcached). Storing and retrieving session information for web-applications
- E-commerce platforms, shopping carts, software applications, including gaming: Amazon DynamoDB provides a highly scalable key-value store, facilitating distributed systems’ seamless operation by handling high traffic and scaling dynamically.
- For quick basic CRUD operations on non-interconnected data
- Storing in-app user profiles and preferences
Unsuitable
- Interconnected data with many-to-many relationships: social networks, recommendation engines
- When high-level of consistency is required for multi-operation transactions with multiple keys
- When apps run queries based on value vs key instead of key vs value
Vendors
- Redis
- Memcached
- Amazon DynamoDB
- Oracle NoSQL db
- Aerospike
- Riak KV, MemcacheDB
Column-family NoSQL
Column-family stores NoSQL databases, also referred to as columnar databases, organize data in columns rather than rows. These databases store columns of data together, making them efficient for handling large data sets with dynamic schemas.
Characteristics
- Uses column-oriented storage: Data is grouped by columns rather than rows, allowing for efficient retrieval of specific columns.
- Delivers scalability: Distributed architecture for high availability and scalability.
These databases are also commonly referred to as Bigtable clones, columnar databases, or wide-column databases. As you can tell from the name, these databases focus on columns and groups of columns when storing and accessing data.
- The column named families consists of several rows.
- Each row has a unique key or identifier that belongs to one or more columns.
- These columns are grouped together in families because they are often accessed together.
- Rows in a column family are not required to share any of the same columns.
- Rows can share all columns, a subset of columns, or none of the columns, and columns can be added to some rows and not to others.
Use cases
- IoT applications manage massive amounts of sensor data efficiently due to their ability to handle time-stamped data at scale, referred to as time-series data analysis. (Apache Cassandra)
- Applications that store and analyze user preferences and behaviors usually deliver personalization. (HBase, part of the Hadoop ecosystem)
- Large-scale data analysis when you’re dealing with large amounts of sparse data. When compared to row-oriented databases, column-based databases can better compress data and save storage space. In addition, these databases continue the trend of horizontal scalability.
- As with key value and document databases, column-based databases can handle being deployed across clusters of nodes.
- Like document databases, a column-based NoSQL database can be used for event logging and blogs, but the data would be stored in a different fashion.
- For enterprise event logging, every application can write to its own set of columns and have each row key formatted in such a way to promote easy lookup based on application and timestamp.
- Counters are a unique use case for column-based databases. You may come across applications that need an easy way to count or increment as events occur. Some column-based databases, like Cassandra have special column types that allow for simple counters.
- In addition, columns can have a time-to-live parameter, making them useful for data with an expiration date or time like trial periods or ad timing.
Unsuitable
- If you require traditional asset transactions provided by relational databases. Reads and writes are only atomic at the row level.
- And early into development, query patterns may change and require numerous changes to the column-based designs. This can be costly and slow down the development timeline.
In a typical data warehousing scenario, you need to collect and store data from different sources, including research such as scientific research, land and business assets, user behavior, and ecommerce data.
- Row-oriented databases in data warehouses store a record or row data in contiguous blocks, while
- column database stores the record using contiguous columns.
This data storage method facilitates faster access to the data, enhancing an organization’s business intelligence and reporting capabilities.
Let’s examine how the company would store ecommerce data in a column-oriented database. In this simplified example, you can see each column or key name and its data, and when doing a query such as display the total price for all orders of product ID P101, the query only needs to read the product ID and the total price column data.
- Examine how data analysts can use columnar databases for their work.
- Financial analysis, online analytical processing, called OLAP, happens when analyzing data that doesn’t change often.
- When performing OLAP, you are working with a large number of records, but only analyzing a subset of the data stored within the columns.
For example, analyzing data to build a histogram of insurance premiums paid during the last financial year.
IoT devices are everywhere, you’ll find these devices in cars, trucks, cameras, smartphones, and even light bulbs and luggage, capturing data on demand continuously or periodically. Storing this data in a traditional database requires a lot of data storage space, and data access can suffer latency. Because IoT data is often used for real time and near real-time analysis, including visualizations, data analysts do not want to incur long wait times when querying data.
Columnar databases help resolve these problems, helping to provide efficient storage and timely access to IoT data.
Vendors
- Apache Cassandra
- HBase
Wide-column NoSQL
Wide-column store NoSQL databases organize data in tables, rows, and columns, like relational databases, but with a flexible schema.
Characteristics
- Use columnar storage: Data is stored in columns, allowing for efficient retrieval of specific columns rather than entire rows.
- Provide horizontal scalability and fault tolerance.
Use cases
- Analyzing big data: Efficiently handling large-scale data processing for real-time big data analytics. (Apache HBase used in conjunction with Hadoop)
- Managing enterprise content: Large organizations databases need to manage vast amounts of structured data like employee records or inventory due. (Cassandra)
Vendors
- Apache HBase
- Apache Cassandra
Graph NoSQL
Graph NoSQL databases are designed to manage highly interconnected data, representing relationships as first-class citizens alongside nodes and properties.
Characteristics
- Store information in entities (or nodes) and relationships (or edges)
- Very impressive when your data set resembles a graph-like data structure
- These dbs do NOT shard well, traversing a graph with nodes split across multiple servers can become difficult and hurt performance
- Graph DBs are ACID transaction compliant (unlike other NoSQL dbs), this prevents any dangling relationships between nodes that don’t exist
- Analyzes the data using a graph data model: relationships are as important as the data itself, enabling efficient traversal and querying of complex relationships.
- Fast performance for relationship queries: optimized for queries involving relationships, making them ideal for social networks, recommendation systems, and network analysis.
Use cases
- Social networks require efficient data management of relationships between users, posts, comments, and likes. (Neo4j)
- Recommendation systems: Organizations need a database structure that can create sophisticated recommendation engines, analyzing complex relationships between users, products, and behaviors for precise recommendations. (Amazon Neptune)
- Highly connected and related data
- Routing, spatial, and map apps
- Recommendation engines
Unsuitable
- When an application needs to scale horizontally
- When trying to update all or a subset of nodes with a given parameter
Vendors
- Neo4j
- Amazon Neptune
- ArangoDB Memcached