Databases vs. file systems
You might be wondering: why use a database instead of a file system? Pretty much everyone knows what files on a computer or phone are. People are very familiar with the concept of files already. So why would you want to store stuff in a database instead of a file system? Well, they really aren’t intended for the same purpose. File systems are for files. A file system is not really a relational data model in the same way that a database is. A file is just a file by itself, but the thing about data in a database is that it relates to other stuff within the database. MySQL is a management system for relational databases. You can search, sort, and so on. Files are just things that exist. When you have relational data, like users who have multiple separate columns that all comprise a record, that’s when a database comes into play.
You typically don’t store binary files in a database. Binary file formats just mean anything that isn’t text. A database is best suited for things like customer or user data, text comments, product names, SKUs, and things like that. If you use something like Instagram or Twitter, and you upload a file with your post (like a JPEG image), what might be happening is that the file is uploaded to a file system on a file server, and a database record is created for that post, and there is a field for an image URL. Rather than having the database store all your files like a file system would, it could merely have a link to the file.
File systems are good at being file systems. Databases are good at being databases. They both have their purpose, but they are not interchangeable. You will need both of them, as they accomplish different things.
If you have only a very small amount of data for your software, you could probably get away with using .json or .xml files instead of a database. But that won’t scale very well. Data exchange formats like XML and JSON are good for simple or short stuff, especially for interacting with remote APIs (examples: the Amazon S3 API uses XML, and WordPress has a JSON API). But when you’re dealing with a ton of data, it might be a better idea to use a database instead. In fact, some programs even use local databases, such as a local SQLite database, rather than storing data in regular old files. Anki, which is smart flash card software, is an example of a program that stores stuff in a local database rather than regular files. Databases are often associated with web development, but you can use them for other things as well.
- CRUD – Create, Read, Update, Delete. A concept that comes up in databases a lot. Think of it from the point of a blog – if you want to design blog software, the user has to be able to create a new blog post, read existing blog posts, update/edit blog posts, and delete them too. You can’t have just a couple of these things. You need all of them together.
- MongoDB – a document-based database. If you’re familiar with JSON, MongoDB will feel right at home for you. MongoDB is a non-relational database, or noSQL. It is more similar to JS and JSON. MongoDB is an alternative to a relational database (which is more traditional), like MySQL (actually it’s a DBMS). This book does not cover MongoDB or noSQL. Mongoose is a library that is commonly used with MongoDB and Node.js. I don’t really recommend noSQL databases, as most businesses are looking for developers who are familiar with relational databases and SQL.
- DBMS – a database management system. Software for dealing with databases. Photoshop is to an image file as a DBMS is to a database. MySQL is the DBMS that you will learn in this book. It’s the most popular free and open source DBMS solution. Oracle also makes database software, but it costs a lot of money, so this book will concentrate on the free MySQL instead. Other popular database options include SQLite or PostgresQL.
- SQL – Structured Query Language. SQL is a way of communicating with databases. When you search for a word on a website, like Amazon or Wikipedia, your search term is put into a query that is used on a database, and the database management system sends you back the results of the query. Of course, searches should be sanitized first, because they could be something malicious like SQL injection. SQL injection is when you try to inject unallowed SQL commands into your searches or logins. Maybe instead of entering in a username, you enter a SQL command to drop the database or select everything from it using a wildcard to give you private information. SQL is used with relational databases, as opposed to NoSQL/non-relational/document-based databases.
SQL vs. NoSQL – NoSQL is merely an alternative, with its own pros and cons. Some cons of noSQL include eventual consistency, lack of range queries, lack of support for ACID queries, and not being compatible with SQL. But one big advantage is that it’s easier to to horizontal scaling in NoSQL (meaning scaling by splitting the database up across multiple computers, rather than having one super decked out computer). NoSQL is more about flexibility, whereas SQL is about rigidity. The lack of relational models and rigid schemas can make certain things harder in NoSQL. In short, NoSQL is not a direct successor to SQL, because it’s not better in every way. NoSQL is an alternative which is not suitable for every situation.
- Atomicity – some database transactions need to happen either completely or not at all. A halfway-finished transaction can cause problems. A database transaction that can’t be done only partially is called atomic. Atoms blow up when you try to split them, so it’s best to leave them intact.
ACID – atomicity, consistency, isolation, and durability. A database should be ACID. Withdrawing cash from an ATM is an example of a transaction that needs ACID. It gives you the cash and your account is deducted by the amount you withdrew. It would be bad if you could withdraw money and then cancel the transaction before it was finished, having your account still say you had the full amount without any money subtracted. Atomicity is doing something entirely or not at all. Durability means not losing data due to glitches or hardware failure. If a database is stored on a hard drive and contains important customer records, it had better be backed up, because hard drives and SSDs do fail every now and then. No hardware lasts forever. Durability also means that changes have to be permanent. Isolation means transactions in a database happen independent of one another. Each transaction has to be isolated. Consistency means a transaction will result in a valid state.
- Rows, columns, records, tables, and databases
- A server can have multiple databases. A database can have multiple tables. A table should be designated for a single purpose. A table can have multiple records in it. Each record, or a row in a table, can contain multiple columns. A column could be a category of data, such as name, phone number, date of birth, username, email, password, and so on. A record should have a unique identifier for it, called a primary key. If you have an online shop database with a customer table in it (which is basically a list of all the company’s customers and their associated information), and there are two different customers with the same name, the way you can tell them apart is with the unique primary keys. For example, maybe there are two customers named Bob Smith, but one has a customerID of 1849 and the other has a customerID of 2673.
- A
database table looks kind of like a spreadsheet.
- Deduplication – have you ever been to a doctor’s office or other business, and they asked you to fill out a form with your name, phone number, address, etc. and then soon after required you to fill out the exact same information? That’s duplicate information. Many inefficient businesses use systems that require duplicated information. There are many downsides to this approach, such as having someone enter in the same stuff multiple times. It’s also a problem when updating information. Maybe you have a new phone number, but it’s stored in two separate places, and updating one doesn’t necessarily mean the other will be updated along with it. Duplication is bad for business. Duplication is the result of data being stored in separate places that don’t share data. These separate-yet-similar things are called data silos.
- Databases are not just a way to organize data into rows and columns in tables, but they’re also a way to share data between multiple users. Users could be employees in different departments within an organization. Old-school systems for storing data often involved each department storing things separately, with no sharing between them. This would result in things like lingering outdated customer information, or requiring someone to fill out two forms with the same info on each.
- Databases allow you to have multiple uses of the data without needing to maintain multiple copies of it separately. The process of getting rid of duplicated data is called deduplication.
- Views – there might be many different users of a database. Keep in mind that user doesn’t mean consumer or employee. It can also mean developers who write applications that interact with a database. One application that uses a shared database might have certain needs. Maybe there is a database table with columns A, B, C, D, and E. One application might only need columns A, B, and C. Another application might need A, C, and D. A third application that uses the database might need B, D, and E. Each of these applications uses a subset of the data. There can be many different requirements for applications that use the same data. They don’t necessarily need every single column. Each subset is called a view. There can be many views for a single database table.
- Normalization
– a way to get rid of repetitive columns and tables in databases.
It’s a way to make things simpler and more efficient.
- Program-data dependence – if making changes to your data means your program that uses the data no longer works, that’s called program-data dependence. Older file-based systems often had program-data dependence, where any change to data (or the structure of the data) required the application developer to change the program that uses the data. But with databases, program-data dependence is a thing of the past – at least for the most part. You want to design your database tables and queries in such a way that a database administrator or developer could add additional rows or columns to the table without it messing up your program that accesses the data.
- Entities – an entity is a real-world thing that you want to represent in a database. It can be something like a person, place, thing, or event. Examples of entities include employees, customers, orders, products, concerts, blog posts, comments, and branch offices.
- Relationships
– entities have relationships between them. This is a big part of
relational databases. A company has multiple branches, and a branch
has multiple employees. Those are examples of relations. Entities
have relationships with one another.
- Entity-relationship diagrams – just like how UML (Unified Modeling Language) is used for coming up with designs for classes within a program, ER diagrams are useful when you’re designing a database system. ER diagrams consist of three main parts: entities, relationships, and multiplicity. You can also add attributes to show what an entity has, though for basic diagrams, the entity name is good enough.
- Here’s an example of an ER diagram from a homework assignment I did for a database class:
- I n the above example, it’s what is referred to as a binary relationship, as there are only two entities involved. A single customer (minimum of 1, maximum of 1) has pet entities, with a minimum being zero (having no pets) and the maximum being unlimited, as designated by the asterisk, which means any number. Each entity has attributes, which would be columns in a database table.
- If you have an ER model (represented by an ER diagram) that has 3 entities instead of 2, it would be called a ternary relationship. 4 would be called quarternary. For an arbitrarily large number of entities, it would be called n-ary. However, most relationships will have a low number of entities involved.
- The following is an example of a ternary relationship:
- You can make an ER model and then visually map it out using an ER diagram to aid in the process of database design. However, just like UML diagrams, many people skip them and just get right into the coding.
- The above ER diagram means “a car contains 2 or more parts from 1 or more suppliers.”
- Joins – a general rule of thumb is that there is one entity per table. So you’d have a Customer table and a Pet table, rather than a single table that contains information about both Customer records and Pet records. Each entity can have multiple attributes, represented with columns. But they are still separate entities. But if you want to use a query that combines multiple entities, such as listing all Customer rows and their corresponding Pet entities, you could use something called a join. Joins allow you to combine stuff from multiple tables. However, joins only apply to the output of a query. Joining two tables will only change what the querier sees, but the tables themselves remain separate within the database.
- An easy way to understand joins is with Venn diagrams. There are many different types of joins, which would be like coloring in different parts of the circles in a Venn diagram. A very common type of join is a simple inner join. That means it only gets results where both things are true.
- Now without further ado, it’s time to set up software on your computer so that you can interact with a database and practice SQL queries. Learning by doing is more effective than just reading.