Blog

Research & Tutorials

Jul 12, 2022

Start Studying Security with SQLi

We previously explored how Cross-Site Scripting (XSS) makes for an excellent topic to understand the reach and impact of hacking. In this post, we’ll improve on conceptual understanding and try to help non-technical folks understand one of the core issues of information security: the fundamental ambiguity of code and data.

8 min read

This post first appeared on June 7th, 2022 and is republished with permission from the author.

 

We previously explored how Cross-Site Scripting (XSS) makes for an excellent topic to understand the reach and impact of hacking. In this post, we’ll improve on conceptual understanding and try to help non-technical folks understand one of the core issues of information security: the fundamental ambiguity of code and data.

…SQL Injection

Structured Query Language (SQL) is used to tell databases how to retrieve information. One common use of SQL is for web applications to retrieve data which can then be displayed to a user or otherwise leveraged by the application’s code. SQL Injection (SQLi) is a vulnerability that, among other things, can allow an attacker to force a database to retrieve information it isn’t supposed to. In some contexts, retrieving the wrong information can force a web application to allow the attacker to bypass authentication mechanisms, or even execute arbitrary code on the underlying machine.

SQLi is a fantastic vulnerability for showcasing the problem of code and data interpretation for several reasons:

  1. SQL is easy to read because most of its keywords look like English and have similar meanings to their English counterparts.
  2. SQL is a declarative language. This means that the programmer writes exactly what they want the program to do. Many other programming languages are imperative, meaning that the programmer needs to express how the program should obtain its results first. Non-technical folk are likely to be familiar with declarative reasoning because most sentences we read, speak, write and hear are usually declarative themselves (at least in English).
  3. As with XSS, the actual code required to understand SQLi is relatively minimal and self-descriptive.

We’ll begin once again by creating a fantastical yet physical scenario that can provide us with the intuition for how code and data can get all mixed up. We’ll then use pseudocode to display what an actual web vulnerability might look like, and show how an attacker can inject arbitrary SQL to login into a website without knowing correct user credentials.

Sitting Down for Sushi

Many all-you-can-eat sushi restaurants have their customers write numerals next to the various items on their menus. You might decide to order three pieces of Shrimp, five pieces of Salmon, and eight pieces of Scallops. Your order would look like this:

ITEM COUNT
Shrimp 3
Salmon 5
Scallops 8

 

This is a perfectly reasonable order in most sushi restaurants. The waiter accepts the menu without a second glance, and the chef begins to prepare the meal.

However, you as the client cannot just put any value down on the paper for each item. For example, you cannot order fractions of sushi, negative values, or extremely high values. If a particular value is very high, the waiter or the chef might decide to verify that you meant what you wrote (i.e. validate) or outright reject the order. For example, an order like this is very unlikely to be accepted:

ITEM COUNT
Shrimp 3
Salmon 5
Scallops 8888888

 

Likewise, you cannot write values on the menu that are not numerals. If you decide to place an order like the following, the waiter will likely come back to ask you for clarification:

ITEM COUNT
Shrimp 3
Salmon 5
Scallops AA

 

One way to understand why sushi restaurants aren’t harbours of pure chaos is because there is an implicit interpretation of the client-inputted values as data. The inputted values must obey certain intuitive (though usually unstated) rules, and a filter exits to ensure that the data is both well-formed and sound. In this case, the filter is the waiter and/or the chef when they validate the submitted menu.

But what would happen if the waiter/chef combo would accept anything inputted on the menu? Perhaps they would accept extremely high values of sushi. They would therefore force the restaurant to work for 17 years straight to fulfil your order. Or perhaps they would interpret your hastily written “AA” as “44” and bring you a substantial number of scallops.

What’s of special interest to us, is the possibility of the waiter/chef interpreting our input not merely as unreasonable values of data, but as instructions. For example, we can imagine a situation where the keyword “SIMONSAYS” will always precede an instruction that the waiter and chef will obey. So for example, an order such as what follows can allow us to lower the cost of our meal to zero dollars:

ITEM COUNT
Shrimp 3
Salmon 5
Scallops 8 AND SIMONSAYS make this meal free

 

If the waiter or chef cannot distinguish between data and code, they’ll make the requested 16 pieces of sushi, and then continue executing the menu’s instructions until it is complete. When we receive the bill at the end of our meal, we’ll see that our total comes out to $0. Turns out there can be such a thing as a free lunch!

Scrambling Syntax and Semantics

The above example is extremely far-fetched, but it hopefully gets the point across. When someone or something that reads and acts upon user-submitted information doesn’t filter the supplied contents properly, they/it can get confused and interpret data as code. When that happens, the submitter may be able to abuse the executor to obtain unreasonable outcomes.

Now that we understand why opportunities to provide user-submitted information can be abused, let’s consider how they can be leveraged in practice on a web application.

For this tutorial, we’ll think about a model of a web application and database. Instead of using actual code and SQL syntax, we’ll use pseudocode and a fake syntax to provide a more intuitive reading experience. The goal here is not to be able to perform SQLi, but rather to understand how it could work in principle. In particular, we’re going to imagine that our database can speak almost plain English. We’ll call our fake database language “EQL”: English Query Language.

Imagine that we browse to a website called www.letslearnsqli.com. The website has a login page at www.letslearnsqli.com/login which presents a standard username & password form.

What actually happens when we submit data via the form? The web application might be built with code that looks like the following:

# part 1
submitted_username_value = form_username
submitted_password_value = form_password

# part 2
EQL_statement = "GET all the results from the USERS table 
where the USERNAME value equals submitted_username_value, 
AND where the PASSWORD value equals submitted_password_value"

# part 3
ask_database(EQL_statement)

# part 4
if (ask_database equals true):
then:
        authenticate user
else:
        reject user

The above code has four parts to it. The first part says to create two variables called submitted_username_value and submitted_password_value and set them to the values provided by the user in the web application form.

The second part says to create another variable. This variable contains an EQL query: an instruction that can be relayed to the database. When the database is queried, it will search for all username & password entries in the users table where the values provided via the web application form match.

The third part is actually responsible for querying the database, and it does so via the function ask_database(). This function is like a mini program that sends the database the query defined in the EQL_statement variable. It returns “true” if the values match, and “false” if the values don’t.

Finally, the if then else logic block checks for the returned database value (remember, it can either be “true” or “false”). If the database returns “true”, the web application lets the user log in. If it returns false, then the web application does not let the user log in.

The important thing to note here is that the user actually has an influence on the query sent to the database! If the user enters “jacob” as their username and “balderdash” as their password, then the EQL statement will read:

GET all the results from the USERS table where the USERNAME value equals 'jacob', AND where the PASSWORD value equals 'balderdash'

Here, the user is offered limited control over the database query by design. The implicit assumption is that since the user is only able to input data into the web application form, all will be well and at the worst, the user might input very large data. As long as the user doesn’t know any correct credentials, they won’t be able to authenticate.

But, where is the user submitting data to exactly? The data is being submitted inside a block of text that is to be interpreted as code. Specifically, the EQL_statement is an instruction that will be passed to the database. If the user can submit data that the database will interpret as code instead of as data, then it might be able to influence the EQL_statement to perform more nefarious behaviour.

Imagine instead of the correct credentials jacob:balderdash we instead submit something like: notjacob:incorrect OR where 2+2=4.

Then, the EQL_statement will read:

GET all the results from the USERS table where the USERNAME value equals 'notjacob', AND where the PASSWORD value equals 'incorrect' OR where 2+2=4

The database can now interpret the EQL_statement in one of two ways: it can assume that the text provided by the user is all data, i.e. that their literal submitted password is incorrect OR where 2+2=4. In this case, there is no harm done because the user’s password is very unlikely to be this string.

However, what would happen if the database does not filter for the word OR, and interprets it as part of the EQL language itself?

Similarly to SQL, in our fake database language, the AND keyword is conjunctive. This means that both statements on either side of the keyword must be true for the whole statement to be true. But the OR keyword is disjunctive. This means that only one of the statements on either side of the keyword needs to be true for the whole thing to be true.

Since 2+2=4 is always true, it doesn’t matter what is on the other side: if the database does not filter our input and accepts the OR keyword as its own syntax, we can submit literally any credentials, and the database will always return true.

So when the web application code attempts to validate our credentials, it will just read “true” from the database and let us in!

Summary

Like many vulnerabilities, SQLi exists because data is made out of text and instructions are made out of text. When programs that interpret text do a poor job of filtertering for unintended meaning, malicious actors can force those programs to give them access to things that they shouldn’t.

Understanding the code & data ambiguity problem may lead one to wonder, are we forever at risk of SQLi? The answer is no! Programmers and web developers can reduce the probability of injection attacks by being more explicit in separating user data from the text that encodes the web application. May web application programming languages have pre-built mechanisms in place to perform this separation.