Learning Apache Cassandra
上QQ阅读APP看书,第一时间看更新

Selecting data

We now know how to retrieve data from the database, but that isn't much good unless we can get it back again. Let's say we now want to build an account settings page for MyStatus; we've got the user's username stored in a persistent session, but we will retrieve the other profile fields from the database to display in the settings form:

SELECT * FROM "users"
WHERE "username" = 'alice';

This query tells Cassandra we want to retrieve the rows where the value for username (the primary key) is alice. The * wildcard simply says we would like all the columns in that row, saving us from having to type them all out. You'll see the rows we requested nicely formatted in the CQL shell as follows:

Selecting data

In other scenarios, we don't need all the columns. When a user tries to log in to MyStatus, we want to retrieve their password and compare it to the one the user provided us with, but we don't care about the email. Avoiding unnecessary columns reduces the amount of data that needs to be transferred between Cassandra and our application, thus making the queries faster. Instead of using a wildcard (*), we can instead type a list of columns we are interested in:

SELECT "username", "encrypted_password" FROM "users"
WHERE "username" = 'alice';

You'll see that in the results, the email column no longer appears, since it wasn't in the list of columns that we specified:

Selecting data

Missing rows

What happens when we ask for a primary key that doesn't exist? Let's try it out:

SELECT * FROM "users"
WHERE "username" = 'bogus';

You'll see that Cassandra simply returns no results; it is not an error to try to retrieve a primary key that does not exist.

Selecting more than one row

Let's say we'd like to build an administrative interface that allows employees to access data for several users in one screen. We could, of course, simply perform a query for each username specified, but Cassandra gives us a more efficient way to do this:

SELECT * FROM "users"
WHERE "username" IN ('alice', 'bob');

This query will return two rows: one with the alice primary key and the other with the bob primary key:

Selecting more than one row

Note that while this will be faster than performing two queries, it does require Cassandra to perform two seeks for the two rows, so querying for the additional row comes at some cost.

Retrieving all the rows

Perhaps we'd like to expand our administrative interface to show a list of all the users who've signed up for MyStatus. To do this, we'll want to simply ask Cassandra for all the user records:

SELECT * FROM "users";

Since we omitted the WHERE portion of the query, Cassandra will simply return all rows in the users table:

Retrieving all the rows

Note

If you've been following along with the examples, you'll see that the latest query returns both rows that we've inserted into the database. However, had we inserted over 10,000 rows, we'd notice that only 10,000 rows would be returned. This is a limit built into cqlsh; using a language driver, you can retrieve result sets of arbitrary size.

You'll also notice that bob is returned before alice. Clearly, rows are not returned in lexical order of their primary key. As it happens, in a table with a single primary key column, row ordering is deterministic but not defined from a client standpoint. Deterministic ordering is enough, however, to allow us to page through very large result sets.

Note

You'll find a full reference for the SELECT statement in the DataStax CQL documentation at http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html.

Paginating through results

In most situations, we will want to avoid displaying an arbitrary number of results to a user in a single response. Instead, we will display a fixed number of results, and give the user an interface to load additional pages of the same size. In an SQL database, pages are typically specified using the OFFSET keyword, but CQL does not have this capability. Instead, we'll use the natural ordering of primary keys to put a lower bound on the key values displayed on a given page.

Let's insert a couple more rows into our table to get a sense of how pagination works using the following INSERT statements:

INSERT INTO "users"
("username", "email", "encrypted_password")
VALUES (
  'carol',
  'carol@gmail.com',
  0xed3d8299b191b59b7008759a104c10af3db6e63a
);

INSERT INTO "users"
("username", "email", "encrypted_password")
VALUES (
  'dave',
  'dave@gmail.com',
  0x6d1d90d92bbab0012270536f286d243729690a5b
);

Now that we've got four users in our system, we can paginate over them. To save ourselves the trouble of adding numerous additional entries by hand, we'll just use a page size of two to demonstrate the process. We'll start by retrieving the first page as follows:

SELECT * FROM users
LIMIT 2;

The LIMIT part of the query simply tells Cassandra to return no more than two results:

Paginating through results

Now that we have our first page, we want to get the second. It would be nice if we could simply ask for the next primary key in order after dave using the following SELECT statement:

SELECT * FROM "users"
WHERE "username" > 'dave'
LIMIT 2;

Unfortunately, this will give an error:

Paginating through results

The message is a bit cryptic, and we don't know what exactly a partition key is yet, but it does contain a helpful hint: we can use the token() function to do what we want. The reason that our attempted query doesn't work is that, as we noticed before, the primary keys in our users table are not stored in lexical order; Cassandra can only return rows in the order in which they are stored.

The actual ordering is determined by the token of the primary key; the way the token is calculated is opaque to us, but Cassandra lets us use the token() function to retrieve the token for a given value:

SELECT "username", token("username")
FROM "users";

Now we can see why the rows are returned in the order they are; they ascend by token:

Paginating through results

Armed with this function, we can retrieve the next page of results as follows:

SELECT * FROM "users"
WHERE token("username") > token('dave')
LIMIT 2;

And just as we'd hoped, the next two rows are returned as shown below:

Paginating through results

Using this technique, we can paginate over arbitrarily large tables using multiple queries.

Note

It bears emphasizing that retrieving large result sets from tables structured like users is a relatively expensive operation for Cassandra. In Chapter 3, Organizing Related Data, we'll begin to develop a more advanced table structure that will allow us to retrieve batches of rows very efficiently.