Organizing and Storing Data

Binary – base 2. We’re used to decimal, which is base 10, meaning we can use 0 through 9. In base 2, or binary, we use 0-1. The reason why computers use binary is that people tried and failed to make stable base 10 vacuum tubes back in the day. It’s much easier to represent just two possible states instead of 10. With fluctuations in voltage, a 3 could look like a 4; an 8 could look like a 7, and so on. Binary can mean 0 or 1, false or true, no or yes, or they can be bits in something larger. A 0 or 1 in binary is called a bit. ASCII uses 8 (or sometimes 7) bits, or one byte, to represent basic text. Unicode uses 2 bytes, or 16 bits, to represent basic alphanumeric text as well as many other kinds of characters, such as other languages and emojis.

Binaries – not to be confused with binary (the base 2 number system), when people refer to software as binaries, they are referring to compiled executables. In other words, finished programs you can run. To download a binary doesn’t mean to just download any old random 1s and 0s, it means you’re downloading code that has been compiled into something your computer can execute. A common shortening of binary is bin.

Octal – base 8, using the numbers 0-7.

Hexadecimal – base 16, often used as an easier way of showing binary data, such as a memory address. Hexadecimal is used much more often than octal for displaying things. Data is not stored as hexadecimal; we only view hexadecimal representations of binary data. 24 is 16, so a single hexadecimal character can be used to represent 4 binary bits. Hexadecimal uses the numbers 0-9 and the letters A-F. A is 10, B is 11, C is 12, and so on. An example of hexadecimal is 0x416C616E. The 0x is a hexadecimal prefix which tells you that the stuff following should be interpreted as hex.

Character encoding – a way of representing text using combinations of 1s and 0s. Some examples of character encoding are UTF-8 and UTF-16.

ASCII – a way of encoding characters, just basic text, numbers, and simple punctuation. An ASCII character is 8 bits, or one byte. There is also an exception, called 7-bit ASCII, but I won’t concentrate on that.

Unicode – unlike ASCII, Unicode has support for thousands of unique characters. It includes all that’s in ASCII, in addition to other language characters, as well as emojis.

Base64 – using plain alphanumeric characters to represent binary data. Base64 consists of 64 different characters – A-Z, a-z, 0-9, /, and +. Any data that can be stored on a computer can be represented in base64, but it’s typically used for binary data (or steganography/hacking, but that’s another topic entirely). Base64 makes text or files look like gibberish, but don’t be fooled – base64 is not encryption. It is trivially easy to decode base64-encoded data.

Text vs. binary data – there are two main classifications of data: text and binary. Text data is pretty straightforward. Binary data uses formats that are not human-readable but make sense to programs. An example of text data would be a .txt file, but something like a jpg image file is stored using a binary data format. If you tried to view the file in a text editor or hex editor, it would look like gibberish.

File IO – file IO means file input and output. Your first programs might not deal with opening or writing to files, but eventually, you will need to deal with them, because values only in RAM won’t do you much good. You want to save and load things now and then. You can make a file, delete a file, open a file, close a file, overwrite a file, or append to a file. When dealing with file IO, you will need to add exception handling to your code. Maybe your code will open a file called list_of_bands.txt. But what if that file doesn’t exist? You need to account for things like that.

When a file is opened, it’s locked and can’t be used by anything else. So you want to make sure to close a file when you’re done with it. When a file is opened, you can do things like get the lines from it and then store them as strings, or whatever you want to do with them. Just beware of running into the EOF, or End Of File. Continuing to parse a file when there’s nothing more in it can lead to problems.

You can perform actions like find and replace, or parse the whole file in its entirety. You can also append to a file, which is to add stuff to the end without deleting the rest of the contents.

If you write to a file, make sure you know the difference between overwriting and appending, otherwise, you can delete all the existing contents, and that might be a bad thing. If you don’t have permission to edit a file, you can get an error with that too.

There are different modes for opening a file. You can open something as read-only, or as writing to it. If all you want to do is load contents from a file into variables in RAM, then open it as read-only.

Common types of files you will deal with for file IO include .txt, .csv, .xml, and .json. These are all text-based files.

If you want to read or write other kinds of files, such as GIFs, that is referred to as binary IO, because you are dealing with binary data formats.

If you study computer science, you might have to write a file parser from scratch, but in the real world, you can use a parser that is either built into a standard library or on GitHub or something.

File IO isn’t the only kind of IO though. The kind of IO you will start with in programming is outputting text to the window and getting input from the keyboard.

Absolute vs. relative path – cat.jpg is a relative path in the current directory.

../images/cat.jpg is a relative path that goes up one level and then down into the images folder.

A relative path shows its path in relation to the current directory. An absolute path is an entire path, such as /Users/bob/Documents/images/cat.jpg

When you are making a website, don’t use absolute paths. Use relative ones. You might also want to make folders for separate things to keep it all organized.

input() – the way to get user input in Python is with the input() function. You need to assign the return value of the function to a variable if you want to keep it.

user_name = input(“Enter your name: “)

print(“Your name is ” + user_name)

Parser – A parser is a program that runs through something. I’ve written a CSV parser that parses through CSV-structured files. I used it for something that required sorting and searching through football players. Quite often, languages will have built-in parsers for numerous kinds of files, such as CSV or JSON. You don’t have to reinvent the wheel when instead you can just read your programming language’s documentation and see if it has what you need already.

Generally speaking, you might want to write or implement a parser to load the contents of a file into RAM or to access select parts of it rather than all of it. Parsers go through the contents of a file, and they need to stop at the EOF, or End Of File.

If someone makes a video game, and they want to add the ability to save and load games, they need to create their own file structure for a game save. They might implement this using XML or JSON, and then create a parser for it to load the save file into RAM so that the user can play where they left off the last time they saved.

EOF – end of file. When you’ve reached the EOF, it’s time to stop parsing a file.

CSV – comma-separated values. A straightforward file format. You can open a CSV file in a program such as Microsoft Excel or LibreOffice Calc, and it will look like a spreadsheet, with rows and columns.

Here’s an example of CSV:

joe,23,555 main street

alice,39,456 country road

bob,27,123 prairie lane

After a certain point, you might have too much data for a simple CSV file. At that point, you will want to move on and use databases instead, which are like spreadsheets on steroids. But to start, or for very simple programs, CSV is fine.

Delimiter – a way of breaking things up, or defining how things are broken up. When you hear the word delimiter, think separator. Comma-separated values (a.k.a. CSV), for example, are values delimited by commas. Command line arguments are delimited by spaces, such as this:

adder_program 4 5 7

In the above example, the arguments are 4, 5, and 7, because there are spaces in between them. But what about in the following example?

adder_program 45 7

In the above example, the command line arguments being passed to adder_program are 45 and 7. Because there is no space between the 4 and the 5, it knows that they’re part of the same argument, as there is no delimiter between the digits.

XML – Extensible Markup Language. XML is a data exchange format, kind of like JSON. A lot of older stuff is XML, whereas a lot of newer things use JSON instead. XML might be popular now, but it’s on its way out, while JSON is on the up and up.

XML looks like HTML, but it’s not. You can make whatever tags you want. It’s just a way to structure your data with tags. Many languages will have built-in parsers for XML, so as long as you import it in your program, such as in Python, you can easily extract the exact piece of the file that you want.

Let’s say you’re making blog software and it takes XML data and combines it with a template. You want to structure your data in a way that makes sense for the given use-case.

XML example:

<article>

<title>Cats are great</title>

<image>cat.jpg</image>

<caption>Picture of a cool cat</caption>

<body>

Cats are awesome. This is XML.

</body>

</article>

I could make any tag I want, even an <cat> tag or <whatever> tag if I really felt like it. Just like how you can name variables in a programming language, you can use whatever identifiers you want in XML. And please don’t get the wrong idea. The above XML example is not HTML, nor is it a web page. It is merely a structured representation of data. In web development, you want to separate content from layout, and XML is one way you can achieve that. XML says what the content of the page will be, but it says absolutely nothing about what the page will look like. There’s also something called XHTML, which is like a combination of HTML and XML. However, XHTML, which was part of HTML 4, has become obsolete and has been replaced by HTML 5. Not only that, but JSON is becoming more popular than XML these days.

For the blog software I was talking about before, if you wanted to put a particular part of the XML into the template, you’d do a find and replace in the HTML template that would retrieve the title to put it into the title placeholder, and so on for all the other parts of it too.

XML is like a dictionary, but it’s a little repetitious because you say each tag twice – once for opening and once for closing. I prefer JSON.

JSON – JavaScript Object Notation. Born out of JS, but you can use it for things that have nothing to do with JavaScript. It’s a data exchange format, like XML, but you’ll see that it’s different in some ways, and in my opinion, it’s better.

Here is a JSON example of the same article stuff from the XML example:

{

“title”: “Cats are great”,

“image”: “cat.jpg”,

“caption”: “Picture of a cool cat”,

“body”: “Cats are awesome. This is JSON.”

}

Compared to XML, it seems a little less repetitious. It also looks a little more like JS instead of HTML. You can also perform nesting, or have arrays and whatnot too. JSON schemas don’t have to be strict. They can be flexible, and as long as your JSON has certain things, it doesn’t always have to be precisely the same as your schema (depending on how you set it up). But depending on what you’re doing, you might want to be more rigid.

In PyPi for Python, you can use this command to install a JSON schema/validation package:

pip install jsonschema

My Wordpress websites have JSON APIs built-in, thanks to the CMS. I didn’t need to lift a finger; it’s already there. Here is an example of a website I’ve made, and how it has an API: https://smartfinancialresearch.com/wp-json/

Schema – a defined structure for something, such as JSON. If I have a JSON file for an article for a blog, the schema might define that there will be a title, date, author, and body text for it. If a file doesn’t have the right structure, it can’t be validated with the schema. Databases also use schemas.

Validation – to validate something is to compare it. In the context of schemas, to validate JSON against a schema is to see if it meets the criteria for what the JSON is supposed to be structured like.

Database – there are many different types of databases, but I’m going to concentrate on relational databases here, as they are the most common. There are other kinds of databases, but I won’t focus on them here. Databases are comprised of multiple tables, each of which can be designated for a separate purpose. A table consists of rows and columns. You can make whatever categories you want for the columns, such as a user table with columns for username, password (which should be hashed instead of being stored in an insecure way), email, and so on. After making the table with columns, you can specify further info about what kind of data should be in each section, and if it’s optional or not. A record, or row, is an entry in it, such as for when a user makes a new account on your website. Each row should have a different primary key to uniquely identify them from other rows.

With database tables, you can use SQL queries to do things like drop a table, select information from a table, search, update, and delete. Some types of database software include MySQL, PostgresQL, SQLite, phpMyAdmin, MySQL Workbench, and Bitnami’s LAMP VM or local WAMP/MAMP stack programs.

← Previous | Next →

Intermediate CS Topic List

Main Topic List

Leave a Reply

Your email address will not be published. Required fields are marked *