A key aspect of database technology is that security of data that resides on databases is at continual risk from hackers. The landscape of database technology has rapidly changed over the past decade. Database technology has morphed into many variants, which serve to better express how data is managed in the organization. The rapid changes and the increased ways that data is managed creates a swelling risk for database security. In this brief article, please find a summary of a few popular database types, as a prelude to future discussions on database security. Also, my hat goes off to several students who continually inspire me in terms of cybersecurity research, digital forensics, and writing.
Much of what is done with databases today could not be accomplished, only a mere few years ago. We’ve witnessed database technology transition from hierarchical databases (e.g. IBM’s IMS as an example) to a network database (e.g. IDMS) and then to relational database technologies. In today’s organization, we are witnessing growing trends in database technology and tools that include NoSQL (Not only SQL), Hadoop (as one variant), and in-memory databases (IMDB). This brief article presents an introduction to each of these technologies intended to inform the IT professional, the cybersecurity analyst, or students with a brief synopsis of each technology.
A good place to start is with a generic definition of what a database management system is. A database management system (DBMS) provides the tools and other facilities to create database files, each which is a compilation of organized data (Stair & Reynolds, 2018). More specifically, a DBMS is the application software environment that allow you to create database files that are each applied to a particular business or engineering need. To use a rough analogy, I liken the DBMS to an application like Microsoft Word and the database file like a Word document that was created from Word. One very common database type is the Relational Database (or relational database model), which has been around since the 1970s. As a side note, in a subsequent article, I’ll discuss some of the security issues that relational databases have, but that discussion will be saved for future work.
In general, relational databases rely on tables, columns, rows, or schemas to organize and retrieve data. I liken tables to Excel spreadsheets (but many of the purist database designers will come hunting for me in saying this). A database table (entity) contains columns (attributes) and, conceptually a given database table is like a spreadsheet that contains column field headings (attributes) that identify the column data. There’s one exception. The database table contains the logic, data type restrictions, the necessary relationships that associate tables (entities), and even coding that help to maintain the robustness and provide the ‘intelligence’ behind the given relational database.
When using the RDBMS, the particular database that you will be creating must first be designed. To develop a database without a sufficient design (including the use of a data normalization process, which is essential to reduce ‘data redundancy’ and data duplication) is anathema to good relational modeling and design. In the Relational Database model, data is compiled using queries, tables, schemas, and various other methods which, when implemented, will allow user to quickly access database records in order to update, delete, or modify the data stored within the database (Rouse, n.d.). Relational databases (e.g. SQL databases) use two-dimensional tabular data organized by an entity and its attributes (like Excel, but with ‘intelligence’). With a rough sketch of what a relational database management system and a database is, let’s discuss a few additional technologies used in today’s organizational environment.
NoSQL is one of the more modern approaches to database technology. It’s been around for about a decade and has gained quite a bit of following and usage. NoSQL database design is an alternative to the more traditional relational database model (RDBMS), which is designed to express data in multiple tables (loosely similar to the two-dimensional Excel table). NoSQL or “Not only SQL” takes the perspective that database design is accommodated by an array of data models. NoSQL is a non-relational database architecture that does not follow the traditional relation model of using the two-dimensional tabular relations using a row-and column design (Stair & Reynolds, 2018). In contrast to the relational model, NoSQL is adapted to work with very large sets of highly distributed data by using more flexible data models. With NoSQL, data may be stored in a schema-less or a free-form fashion. This allows for great flexibility as the data types are not as restrictive as other previous database types. As such, data may be stored in any record.
NoSQL spreads files. The NoSQL database does provide a slightly more complicated way to mass and recover data using a method that spreads files amongst multiple servers. These servers can exist anywhere (of course, there are design considerations with this type of deployment). Unlike relational databases, NOSQL databases does not require a logical or physical structure (Stair & Reynolds, 2018). This approach to managing data allows for quicker search speeds, as NoSQL databases have only two columns that consist of a “key” and its associated value (Stair & Reynolds, 2018).
A primary advantage of NoSQL, as noted before, is that it uses a distributed data system, whereby data is spread over multiple servers, ensuring that each server contains only a subset of the total data. This is known as horizontal scaling and enables hundreds or thousands of servers to operate on the data. This provides faster response times for queries and updates (Stair & Reynolds, 2018). NoSQL databases do not require a predefined schema; data entities can have attributes edited or assigned to them at any time. This means that information can be added or edited at any time to the database. If new entries are added at any time, it can be changed dynamically extending the database that is already modeled. There are many different uses for NoSQL databases, essentially these are divided into four main categories: (a) key-value, (b) document, © graph, and (d) column.
There are various commercial applications for NoSQL technology. There are four main categories of NoSQL variations. These are key-value data stores, document stores, wide column stores, and graph stores. Key-value stores can use any type of binary object for storage and is accessed via a key. It is the most flexible variant. Document stores use a value which can be a single document with all the data related to a specific key. It can be indexed, and documents can have the same or different structures. Wide-column stores use the more traditional rows and columns, but the names and formats of the columns can be variable in each row. Graph stores use a graph structure to perform queries and to store and map. It allows for index-free adjacency (Riak, n.d.).
NoSQL databases have flexible schemas that allow developers to create and manage modern applications. Because NoSQL does not use SQL to manipulate or analyze data and because it is not built on the concept of tables, it is more agile than traditional relational databases (IBM, 2019). NoSQL is particularly useful for storing unstructured data, which is growing far more rapidly than structured data and does not fit the relational schemas of RDBMS. NoSQL databases are generally used by organizations that have massive amounts of data that they need to compile, drill down, and/or graph data. Common types of unstructured data include user and session data; chat, messaging, and log data; time-series data such as IoT and device data; and large objects such as video and images (Yegulalp, 2017).
Flavors. There are several implementations of NoSQL. There are a variety of commercial NoSQL products. Looking just at one, MongoDB, we see that NoSQL is used by Coinbase, SEGA, and the City of Chicago (this author’s hometown!). There are also uses of MongoDB by Adobe, AstraZeneca, Chico’s, and Cisco, (MongoDB, Inc.,2019).
There are different NoSQL database types, and between them are four common models for storing data. In document databases such as MongoDB, data is stored in the form of free-form JSON structures called “documents.” Data may be anything from integers to strings to freeform text. In key-value stores such as Redis, free-form values are accessed in the database by way of keys. In wide column stores such as Cassandra, data is stored in columns instead of rows; any number of columns may be aggregated for data views or queries. In graph databases such as Neo4j, data is represented as a graph or network and their relationships; each node is a free-form chunk of data (Yegulalp, 2017).
The benefits of NoSQL as compared to other database approaches include scalability, enhanced performance with lower overhead costs, high availability, global availability, and flexible data modeling, (Riak, n.d.). NoSQL can be advantageous if one needs fast access to the data and are more concerned with the speed and simplicity of access rather than the consistency or reliability of transactions. It can also be useful if one is storing large amounts of data and does not want to be locked into a particular schema. As well, NoSQL is advantageous for storing data in a hierarchical structure that is described by the data itself, and not an external schema (Guess, 2016). A variety of industries have found NoSQL to be useful. It is used in third-party data aggregation such as getting sales data from stores as well as customers’ purchasing histories. The Internet of Things uses NoSQL to expand concurrent access to data from billions of devices and systems which are connected. Social gaming uses NoSQL as the amount of data can be scaled to incorporate a number of growing users (Vaghani, 2019).
Some commonly known entities that use NoSQL databases are Apple, Facebook, Google, and even the National Security Agency (Stair & Reynolds, 2018). There are a variety of NoSQL databases including cloud-based applications like Amazon Dynamo DB, Google Big Table, Apache Cassandra, and MongoDB, (Rouse, Vaughan & Beal, 2015). Amazon DynamoDB is a NoSQL database that supports both document and key-value store models. The MLB advanced media uses DynamoDB to power its revolutionary Player Tracking System, which highlights their statistics and nuances of the game. This can be used in the MIS environment in multiple different ways, in stores and product bases, in sports tracking and statistics, in businesses for inventory.
Big data is a significant influence on database management. We could go into a very long description of big data, new ways to look at large volumes of data, and the use of Artificial Intelligence (which is a growing area within big data), but we’ll hold on to this though for subsequent articles.
In general, to respond to the burgeoning Big Data movement, Hadoop was created. Hadoop is an open-source framework that manages data processing and storage for clustered big data. It is used for advanced analytics which includes predictive analytics, data mining, and machine learning applications. Hadoop works with both structured and unstructured data which gives it flexibility over relational databases and data warehouses (Rouse & Rosencrance, 2019).
Hadoop was created by Doug Cutting and Mike Cafarella. Indeed, it is named for Cutting’s son’s stuffed elephant. Like NoSQL, there are numerous commercial applications of Hadoop. One of the significant aspects of Hadoop is that is open-source technology (Rosue & Rosencrance, 2019). The Apache Hadoop project continues to develop this open-source software. Anyone can download it from the project website. Additionally, users are encouraged to enhance and develop it further (Apache Software Foundation, 2019).
Hadoop is an open-source software framework that provides a means for processing and storing extremely large datasets (Stair & Reynolds, 2018). Hadoop’s open-source software code, which includes various modules, is freely available for the public to view, edit, and redistribute (Esri, 2011). Hadoop is a collection of open-source libraries for distributed processing of large data sets across thousands of computers in clusters using simple programming models. Instead of relying on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application level. Hadoop is part of the Apache Software Foundation, which supports the development of open-source software projects. (“What is Hadoop?” n.d.).
Hadoop basically divides the data into subsets and distributes them onto different servers for processing. There are potentially thousands of servers in a Hadoop cluster, and as a result, it makes sure that the system is functioning and running, even if one of the servers were to crash. According to the Bernard Marr website, four of the most essential modules of Hadoop are: (a) (Hadoop Distributed File System or HDFS) — HDFS allows storage of data using linked storage devices, (b) MapReduce — A Java-based system that allows retrieval of data to be placed in suitable formats for analysis, © the Hadoop Common — This allows compatibility for commonly used software to read data in the Hadoop distributed file system, and (d) YARN — Hadoop’s storage data resource manager.
Hadoop works on the principle of horizontal scalability so that nodes can be added on the fly (Kapahi, 2019). Hadoop splits large files into blocks that are distributed across nodes in a cluster to be processed. The base framework is made up of Hadoop Common, which contains libraries and utilities for other Hadoop modules; HDFS, a distributed file system that stores data on commodity machines; YARN, which works as a resource management platform; and MapReduce, which is for large scale data processing. Hadoop has a number of advantages. It is easy to use, scalable, and cost-effective. It accepts a variety of data or structured or unstructured form. It can accept text files, XML files, images, CSV files, and more (Kapahi, 2019).
Open-source software is gaining in popularity, as it is entirely customizable to a company’s needs. Open-source software is non-proprietary, which translates into lower costs and more flexibility, due to its ability to be modified. Open-source software’s code is freely available to its users; anyone may use the source code, modify it, and distribute their own versions of the program (Apache Software Foundation, 2019). Hadoop is a modular software framework that carries out specific tasks for big data collection (Stair & Reynolds, 2018).
Hadoop data works by batches when processing jobs on large data sets. For real-time data analysis, batching data is not as time-efficient as compared to relational databases. Hadoop cannot be used to deploy relational databases, due to its slow response. The same reason makes Hadoop unfit for any general networked file systems; HDFS lacks many of the standard features of the POSIX file system, and as such, a file created or closed cannot be changed but can only be appended (Woodie, 2018). Hadoop has a limitation in that it can only perform batch processing and cannot process real-time streaming data (Stair & Reynolds, 2018). For this reason, Hadoop databases are not used in the financial industry as it can’t keep up with the required instantaneous changes needed.
In spite of several drawbacks, Hadoop is used by many organizations. Big data is wild, complex, and voluminous. Most large organizations need this data to be tamed into an organized, structured format, which is where Hadoop plays its biggest role. Organizations all over the world, including industries like banking, health care, education, social media, and government all use this data compilation software to make heads or tails of all the data that is acquired on a daily basis (“What is Big Data and Why it Matters,” n.d.). Yahoo is a company that has used Hadoop for years to better personalize its ads and articles that visitors of the website see. It is also used by the likes of eBay, Yelp, and Twitter too.
In-memory Database (IMDB)
I’ll start with one of the benefits of In-memory Databases before we get into a description of what it is. IMDB’s can reduce a data center’s footprint. Let’s dig into what IMDB is. An in-memory database (IMDB) is a system that depends on the main memory of a computer system instead of disk-optimized databases. IMDB is a database management system that can store an entire database in random access memory (RAM) (Stair & Reynolds, 2018). Because IMDB works directly with RAM, perform best on multiple multi-core CPUs that can process parallel requests that can better support access requirements of processing of large amounts of data (Stair & Reynolds, 2018). Since IMDB manipulates data directly in dedicated memory-based architecture, it is well suited for the rapidly changing demands of telecommunications and mobile ad networks, which have crucial response times (Technopedia, Inc., 2019).
IMDB can be used with Hadoop and NoSQL. This may be accomplished using an in-memory data grid (IMDG). The IMDG copies the data from the disk. The disk can hold the data structured for NoSQL or Hadoop. It copies it into the RAM so that the processing can then take place without the delays caused by disk reading and writing. This gives it some of the advantages of IMDB. IMDB without IMDG is best for new applications. Many organizations are shifting to IMDB or some hybrid. It is clear that MIS professionals will need to understand this technology and understand how to adopt it (Ivanov & IDG Contributor Network, 2018).
IMDBs can reduce costs. Because IMDBs work in RAM, servers are not needed for processing data, as in the case of NoSQL and Hadoop software. This is different from NoSQL and Hadoop, which both employ a disk storage mechanism. The disadvantage of using typical database management systems is that they are perpetually moving data from disk to memory and back again, which can cause some performance issues for IMDBs (Mullins, 2017). But, with in-memory database management systems, everything is processed “in-house,” which allows for a reduction in costs and provides companies with better overall performance due to the processing of the data using parallel CPU’s (Stairs & Reynolds, 2018).
As a benefit of IMDBs, accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk-based frameworks. Applications where response time is critical, such as those running telecommunications network equipment and mobile advertising networks, may use in-memory databases. (Anikin, 2016). In contrast, a disadvantage to in-memory database systems is the cost of memory can limit the size of the IMDB on a single server (Anikin, 2016).
IMDBs are used by a variety of organizations. IMDBs are used by AdJuggler, eBay, Colgate, Lockheed Martin, SAP, Oracle, and many others (Stair & Reynolds, 2018). Examples of companies that utilize this would be KDDI corporation in a Japanese telecommunications company that provides cellular service for over 40 million customers. KDDI has consolidated its databases into roughly 40 servers, reducing the footprint by 83 percent and the power consumption by 70%. This is quite impressive. The result is that the costs are lower, and the service is better. The main providers are Altibase which creates HDB- this is used in E-Trade and China Telecom. There are a number of in-memory database solutions. Here’s a quick rundown of a few IMDB solutions:
1. VoltDB is an in-memory, operational database created to help organizations build high-velocity applications. It is used by Sprint, Nokia, Mitsubishi Electric, and others. It promises enhanced speed and performance, real-time personalization and decision making, SQL familiarity, integrated Kafka export and import, and database replication for disaster recovery (Predictive AnalysisToday, 2019).
2. MemSQL is an in-memory database that combines the horizontal scalability of distributed systems with the familiarity of SQL. It utilizes an in-memory row store and a disk-based column store in a single database. MemSQL is used by Uber, Hulu, Sony, Comcast, and others (MemSQL, 2019).
3. Times Ten by Oracle is an IMDB and is used by Lockheed Martin and Verizon Wireless.
4. High-Performance Analytic Appliance (HANA) was created by SAP and this is used by eBay and Colgate.
5. Software AG has Terracotta Big Memory, which is used by AdJuggler. (Stair & Reynolds, 2018, P228).
This article provided an overview of several database technologies in use today. Additional in-depth articles will follow. Please feel free to add to the conversation (e.g. send me a note about any additional detail that you may have to support this discussion or if you would like more information about any of these database types). Further, in subsequent articles, database security for the varied database technologies will be discussed.
Anikin, D. (2016, October 12). What an in-memory database is and how it persists data efficiently. Retrieved from Medium: https://medium.com/@denisanikin/what-an-inmemory-database-is-and-how-it-persists-data-efficiently-f43868cff4c1
Apache Software Foundation. (2019). Apache Hadoop. Retrieved from Apache Hadoop: https://hadoop.apache.org/
Bernard Marr. (n.d.). What is Hadoop? Retrieved from https://www.bernardmarr.com/default.asp?contentID=1080
Esri. (2011, April). Open Source Software. Retrieved from https://www.esri.com/news/arcnews/spring11articles/open-source-technology-and-esri.html
Guess, A. (2016, May 6). 3 Key Advantages of NoSQL Databases. Retrieved from Dataversity: dataversity.net/3-key-advantages-nosql-databases/
IBM. (2019). What is a NoSQL database? Retrieved from IBM: https://www.ibm.com/cloud/learn/nosql-databases
Ivanov, N., & IDG Contributor Network. (2018, August 29). In-memory data grids vs. in memory databases. Retrieved from https://www.infoworld.com/article/3300747/inmemory-data-grids-vs-in-memory-databases.html
Kapahi, S. (2019, July 4). Why Hadoop? Retrieved from Edureka: https://www.edureka.co/blog/why-hadoop/
Kognito, Ltd. (2019). Which companies are using Hadoop for big data analytics? Retrieved from https://kognitio.com/big-data/companies-using-hadoop-big-data-analytics/
MemSQL. (2019). MemSQL. Retrieved from MemSQL: https://www.memsql.com/ Predictive Analysis Today. (2019).
mongoDB, Inc. (2019). Our Customers. Retrieved from https://www.mongodb.com/who-uses-mongodb
Mullins, C. S. (2017, July 5). What is an In-Memory Database System? Retrieved from http://www.dbta.com/Columns/DBA-Corner/What-is-an-In-Memory-DatabaseSystem-119241.aspx
Popescu, I. (2015, April 11). 6 Trends in Database Management. Retrieved from http://www.mbmsoftware.com/blog/technology/6-trends-in-database-management2985.html
Riak. (n.d.). NoSQL Databases. Retrieved from https://riak.com/resources/nosql-databases/
Rouse, M., & Rosencrance, L. (2019, July). What is Hadoop? A definition from WhatIs.com. Retrieved from https://searchdatamanagement.techtarget.com/definition/Hadoop
Rouse, M. (n.d.). What is a Database Management System? — Definition from WhatIs.com. Retrieved from https://searchsqlserver.techtarget.com/definition/databasemanagement-system
Rouse, M., Vaughan, J., & Beal, B. (2017, March). What is NoSQL (Not Only SQL database)? — Definition from WhatIs.com. Retrieved from https://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL
StackShare, Inc. (2019). What are the best In-Memory Databases Tools? Retrieved from https://stackshare.io/in-memory-databases
Stair, R., & Reynolds, G. (2018). Principles of information systems (11th ed.). Boston, MA: Course Technology Cengage Learning.
Technopedia, Inc. (2019). What is an In-Memory Database? — Definition from Techopedia. Retrieved from https://www.techopedia.com/definition/28541/in-memory-database
VoltDB. Retrieved from PAT Research: https://www.predictiveanalyticstoday.com/voltdb/
Vaghani, R. (2019). Use of NoSQL in Industry. Retrieved from GeeksforGeeks: https://www.geeksforgeeks.org/use-of-nosql-in-industry/
Weldon, D. (2018, July 3). June top reader pick13 top products for in-memory databases. Retrieved from https://www.information-management.com/slideshow/13-top-products for-in-memory-databases
Wilson, C. (2017, April). Who is Using Hadoop? What are They Using It For? How is It Going? Retrieved from https://blog.syncsort.com/2015/06/big-data/whos-using-hadoop-and-what-are-they-using-it-for/
Woodie, A. (2018, October 18). Is Hadoop Officially Dead? Retrieved from Dadanami: https://www.datanami.com/2018/10/18/is-hadoop-officially-dead/
What is Big Data and Why it Matters. (n.d.). Retrieved from https://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Yegulalp, S. (2017, December 1). What is NoSQL? Databases for a cloud-scale future. Retrieved from InfoWorld: https://www.infoworld.com/article/3240644/what-is-nosql-databasesfor-a-cloud-scale-future.html
About the Author
Dr. Ron McFarland, CISSP, PMP is a Cyber Security Analyst at CMTC. He received his doctorate from Nova Southeastern University’s School of Engineering and Computer Science and a post-doc graduate certificate in Cyber Security Technologies from the University of Maryland (Global Campus). He also holds multiple security certifications including the prestigious Certified Information Systems Security Professional (CISSP) certification and several Cisco certifications. He is a guest blogger at the Wrinkled Brain Network (https://wrinkledbrainnetwork.com/ ), a blog dedicated to Cyber Security and Computer Forensics. Dr. McFarland can be reached at his University of Maryland email is: firstname.lastname@example.org
Publications in 2019
Title: “Overcoming Certification Rejection — A Recovering CCFP Computer Forensics Certification Survivor,” published in Peerlyst — Information Security (article link: https://www.peerlyst.com/posts/overcoming-certification-rejection-a-recovering-ccfp-computer-forensics-certification-survivor-highervista?trk=search_page_search_result)
Title: “A Primer to Function Point Analysis for the Software Project Manager,” published in Peerlyst — Information Security (article link: https://www.peerlyst.com/posts/a-primer-to-function-point-analysis-for-the-software-project-manager-highervista?trk=search_page_search_result)
Title: “Trends in Database Technology and Database Security (part 1),” published in Medium (article link: https://medium.com/@highervista/trends-in-database-technology-and-database-security-part-1-b1600b1d491e)
Title: “Protection from Open Source Code and COTS — the NIST 8183,” published in Medium (article link: https://medium.com/@highervista/protection-from-open-source-code-and-cots-the-nist-8183-e17c27aade7e)
Title: “Data Leakage & Application Programming Risk Mitigation,” published in Peerlyst — Information Security (link: https://www.peerlyst.com/posts/data-leakage-and-application-programming-risk-mitigation-highervista?trk=search_suggestion_query)