We explore a comprehensive overview of the ins and outs of NoSQL databases for Data Scientists.
Data Science and NoSQL Databases
Being a data scientist is not only about building machine learning models but also about being able to process, analyze and better communicate your results from data in different formats.
Traditional SQL databases have been used as the only type of database for years. However, due to the extreme popularity of the internet in the mid-1990s and digital transformation, a new data type became prominent: NoSQL databases. They were introduced in response to the weakness of traditional SQL databases.
NoSQL databases can be used, for instance, by Data Scientists and Machine Learning Engineers for storing data, models’ metadata, features, and operations parameters. On the other hand, data engineers can leverage them for storing and retrieving cleaned data.
In this conceptual blog (no coding required), we'll first build your understanding of NoSQL databases before exploring the importance of NoSQL. We'll also compare SQL and NoSQL databases and look at the uses and categories of the latter. Finally, we'll examine the most popular NoSQL databases for data scientists.
What are NoSQL databases?
NoSQL stands for Not Only SQL, meaning that NoSQL databases have the specificity of not being relational because they can store data in an unstructured format. The following graphic highlights the main five key features of NoSQL databases.
Why are NoSQL databases important?
NoSQL databases have become popular in the industry because of the following benefits:
- Multi-mode data: NoSQL databases offer more flexibility than traditional SQL databases because they can store structured (e.g. data captured from sensors), unstructured (images, videos, etc.), and semi-structured (XML, JSON, etc.) data.
- Easy scalability: this is made simple because of their peer-to-peer architectures, meaning multiple machines can be added to the architecture.
- Global availability: this makes it possible to access the same data simultaneously through different machines from different geographical zones because the database is shared globally.
- Flexibility: NoSQL databases can rapidly adapt to changing requirements with frequent updates and new features.
NoSQL Databases vs. SQL Databases
SQL databases use structured query languages to perform operations, requiring the use of predefined schema to better interact with the data.
On the other hand, NoSQL databases use a dynamic schema to query data. Also, some NoSQL databases use SQL-like syntax for document manipulation.
SQL databases have a predefined and fixed format, which cannot be changed for new data.
NoSQL databases are more flexible. This flexibility means that records in the databases can be created without having a predefined structure, and each record has its own structure.
SQL databases are only vertically scalable, meaning that a single machine needs to increase CPU, RAM, SSD, at a certain level to meet the demand.
NoSQL databases are horizontally scalable, meaning that additional machines are added to the existing infrastructure to satisfy the storage demand.
Big Data Support
The vertical scaling makes it difficult for SQL databases to store very big data (petabytes).
The horizontal scaling and dynamic data schema make NoSQL suitable for big data. Also, NoSQL databases were developed by top internet companies (Amazon, Google, Yahoo, etc.) to face the challenges of the rapidly increasing amount of data.
SQL databases use the ACID (Atomicity, Consistency, Isolation, Durability) property.
NoSQL databases, on the other hand, use the CAP (Consistency, Availability, Partition Tolerance) property.
When should NoSQL databases be used?
In this fast-growing and competitive environment, industries need to collect as much data as possible to satisfy their business goals. Collecting data is one thing, but storing them in the right infrastructure is another challenge. The difficulty comes because data can be of different types such as images, videos, text, and sounds. Using relational databases to store these different data types is not always a smart move. However, the question remains:
When to use NoSQL instead of SQL?
You should consider using NoSQL when you are in the following scenario:
- Constant changing of data: when you do not know how your system or applications will grow in the future, meaning that you might want to add new data types, new functions, etc.
- A lot of data: when your business is dealing with huge data that might grow over time.
- No consistency: when data consistency and 100% integrity are not your priority. For example, when you develop a social media platform for your business, all the employees seeing your posts at once might not be an issue.
- Scalability and cost: NoSQL databases allow greater flexibility and can control costs as your data needs change.
4 Main Types of NoSQL Databases
NoSQL databases are divided into four main categories. Each one has its specificity, so you should choose the one that best fits your use case: Below, we've highlighted the main NoSQL database examples. This section aims to cover each of these databases by providing their role and a non-exhaustive list of their advantages and limitations, and their use cases.
1. Document Databases
This type of database is designed to store and query JSON, XML, BSON, etc., documents. Each document is a row or a record in the database and is in the key-value format. A document stores information about one object and its related data. For instance, the following database contains three records, each one gives information about a student. For the first document, firstname is a key, and Franck is its value.
Document Database Advantages
- Schemaless: there are no limitations in terms of the format and structure of the data storage. This is beneficial, especially when there is a continuous transformation in the database.
- Easy to update: a piece of new information can be added or deleted without changing the rest of the existing fields of that specific document.
- Improved performance: all the information about a document can be found in that exact same document. There is no need to refer to external information, which might not be the case for a relational database where the user might have to request other tables.
Document Database Limitations
- Consistency check issues: because documents do not necessarily need to have a relationship with one another, and two documents can have different fields.
- Atomicity issues: If we have to change two collections of documents, we will need to run a separate query for each document.
When to Use Document Databases
- Recommended when your data schema is subject to constant changes in the future.
Document Database Applications
- Because of their flexibility, document databases can be practical for online user profiles, where different users can have different types of information. In this case, each user’s profile is stored only by using attributes that are specific to them.
- They can be used for content management, which requires effective storage of data from a variety of sources. That information can then be used to create and incorporate new types of content.
2. Key-value Databases
These are the simplest types of NoSQL databases. Every item is stored in the database in a key-value pair. We can think of it as a table with exactly two columns. The first column contains a unique key. The second column is the value for each key. The values can be in different data types, such as integer, string, and float, or more complex data types, such as image and document.
The following example illustrates a key-value database containing information about customers where the key is their phone number, and the value is their monthly purchase.
Key-value Database Advantages
- Simplicity: the key-value structure is straightforward. The absence of data type makes it simple to use.
- Speed: the simple data format makes read and write operations faster.
Key-value Database Limitations
- They cannot perform any filtering on the value column because the returned value is all the information stored in the value field.
- It is optimized only by having a single key and value. Storing multiple values would require a parser.
- The value is updated only as a whole, which requires getting the complete data, performing the required processing on that data, and finally storing back the whole data. This might create a performance issue when the processing requires a lot of time.
When to Use Key-value Databases
- Adapted for applications based on simple key-based queries.
- Used for simple applications that need to temporarily store simple objects such as cache.
- They can be used as well when there is a need for real-time data access.
- They are better for simple applications that need to temporarily store simple objects such as cache.
3. Wide-column Databases
As the name suggests, column-oriented databases are used to store data as a collection of columns, where each column is treated separately, and the implementation logic is based on Google Big Table paper. They are mostly used for analytical workloads such as business intelligence, data warehouse management, and customer relationship management.
For instance, we can quickly get the average age and average price respectively of customers and products with the aggregation function AVG on each column.
4. Graph/node Databases
Graph databases are used to store, map and search relationships between nodes through edges. A node represents a data element, also called an object or entity. Each node has an incoming or outcoming edge. An edge represents the relationship between two nodes. Those edges contain some properties corresponding to the nodes they connect.
“Zoumana studies at Texas Tech University. He likes to run at the Park inside the University”
Graph/node Database Advantages
- They are an agile and flexible structure.
- The relationship between nodes in the database is human readable and explicit, thus easy to understand.
Graph/node Database Limitations
- There is no standardized query language because each language is platform-dependent.
- The previous reason makes it difficult to find support online when facing an issue.
When to Use Graph/node Databases
- They can be used when you need to create relationships between data elements and be able to quickly retrieve those relationships.
- They can be used to perform sophisticated fraud detection in real-time financial transactions.
- They can be used for mining data from social media. For instance, LinkedIn uses a graph database to identify which users follow each other, and the relationship between those users and their expertise (ML Engineer).
- Network mapping can be a great fit for representation as a graph since those networks map relationships between hardware and the services they support.
7 Best NoSQL Databases for Data Science
Now that you have a better knowledge of NoSQL databases, let’s look at a list of NoSQL databases that are popular for data science projects. This analysis is only focused on open-source NoSQL databases.
MongoDB is an open-source document-oriented database that stores data in JSON format. It is the most commonly used database and was designed for high availability and scalability, providing auto-sharing and built-in replication. Our Introduction to MongoDB course covers the use of MongoDB and Python. It helps in acquiring the skills to manipulate and analyze flexibly structured data with MongoDB. Uber, LaunchDarkl, Delivery Hero, and 4300 companies use MongoDB in their tech stack.
Cassandra is also an open-source large column database. It can distribute your data across multiple machines and automatically repartition as you add new machines to your infrastructure. Uber, Facebook, Netflix, and 506 other companies use it in their tech stack.
Similar to MongoDB, Elasticsearch is also a document-oriented database and open-source. It is a world-leading search and analytical tool focusing on scalability and speed. Uber, Shopify, Udemy, and about 3760 other companies use it in their stack.
Neo4J is an open-source graph-oriented database. It is mainly used to deal with growing data with relationships. Around 220 companies reportedly use it in their tech stack.
This is a distributed and column-oriented database. It also provides the same capabilities as Google’s BigTable on top of Apache Hadoop. Reportedly, 81 companies use HBase on their tech stack.
CouchDB is also an open-source document-oriented database that collects and stores data in a JSON format. Around 84 companies use it on their tech stack.
Also an open-source database, OrientDB is a multi-model database supporting graph, document, key-value, and object models. Only 13 companies reportedly use it on their tech stack.
This blog has covered the main aspects of NoSQL databases and how they can be beneficial to your data science projects in today’s fast-growing environments. You have all the tools at your disposal to choose from in order to implement the right database for your use case. If you are still hesitant about using them, now is the time for you and your teammates to leverage the power of these databases.
To learn more, our course covering NoSQL concepts will strengthen your knowledge about the four major databases we previously covered.
Learn more about Data Science