“billions of rows * millions of columns * thousands of versions = terabytes or petabytes of storage” (The HBase project)
Apache HBase is an open source implementation of Google’s BigTable. It is built atop Apache Hadoop and is tightly integrated with it. It is a good choice for applications requiring fast random access to very large amounts of data.
HBase stores data in a form of a distributed sorted multidimensional persistence maps called Tables. The table terminology makes it easier for people coming from the relational data management world to abstract data organization in HBase. HBase is designed to manage tables with billions of rows and millions of columns.
HBase data model consists of tables containing rows. Data is organized into column families grouping columns in each row. This is where similarities between HBase and relational databases end. Now we will explain what is under the HBase table/rows/column families/columns hood.
This is to summarize an HBase table’s mappings:
- a row key maps to a list of column families
- a column family maps to a list of column qualifiers (columns)
- a column qualifier maps to a list of timestamps (versions)
- a timestamp maps to a value (the cell itself)
Based on this you will get the following:
- if you are retrieving data that a row key maps to, you’d get data from all column families related to the row that the row key identifies
- if you are retrieving data which a particular column family maps to, you’d get all column qualifiers and associated data (maps with timestamps as keys and corresponding values)
- if you are retrieving data that a particular column qualifier maps to, you’d get all timestamps (versions) for that column qualifier and all associated values.
Tables are declared up front at schema definition time. Row keys are arrays of bytes and they are lexicographically sorted with the lowest order appearing first.
HBASE returns the latest version of data by default but you can ask for multiple versions in your query. HBase returns data sorted first by the row key values, then by column family, column qualifier and finally by the timestamp value, with the most recent data returned first.
What is a multidimensional map?
A map is an abstract data type that represents a collection of key-value pairs. This is how it looks like presented in JSON:
{ "name": “Baruch Spinoza", "email": "baruch.spinoza@spinozahome.com" }
We can think about a multidimensional map as a map of maps. For example:
{ "rowkey1": {"cf11": {"column111": {"version1111": value1111, "version1112": value1112}, "column112": {"version1121": value1121, "version1122": value1122, "version1123": value1123, "version1124": value1124}, "column113": {"version1131": value1131} }, "cf12": {"column121": {"version1211": value1211}, "column122": {"version1221": value1221}, "version1222": value1222} } }, "rowkey2": {"cf11": {"column111": {"version2111": value2111, "version2112": value2112}, "column112": {"version2121": value2121, "version2122": value2122, "version2123": value 2123, "version2124": value2124} }, "cf12": {"column121": {"version2211": value2211}, "column122": {"version2221": value2221} } } }
In the example above, each key (“rowkey1″,”rowkey2”) points to a map with exactly two keys, “cf11” and “cf12”.
We call the top level key/value pair rows.
The “cf11” and “cf12” mappings are Column Families.
How is it distributed?
HBase is built upon distributed filesystems with file storage distributed across commodity machines. The distributed file systems HBase works with include
- Hadoop’s Distributed File System (HDFS) and
- Amazon’s Simple Storage Service (SS3).
HDFS provides a scalable and replicated storage layer for HBase. It guarantees that data is never lost by writing the changes across a configurable number of physical servers.
The data is stored in HFiles, which are ordered immutable key/value maps. Internally, the HFiles are sequences of blocks with a block index stored at the end. The block index is loaded when the HFile is opened and kept in memory. The default block size is 64 KB but it can be changed since it is configurable. HBase API can be used to access specific values and also scan ranges of values given a start and end key.
Since every HFile has a block index, lookups can be performed with a single disk seek. First, HBase does a binary search in the in-memory block index to find a block containing the given key and then the block is read from disk.
When data is updated it is first written to a commit log, called a write-ahead log (WAL) and then it is stored in the in-memory memstore.
When the data in memory exceeds a given maximum value, it is flushed as an HFile to disk and after that the commit logs are discarded up to the last unflushed modification. The system can continue to serve readers and writers without blocking them while it is flushing the memstore to disk. This is done by rolling the memstore in memory where the new empty one is taking the updates and the old full one is transferred into an HFile. At the same time, no sorting or other special processing has to be performed since the data in the memstores is already sorted by keys matching what HFiles represent on disk.
The write-ahead log (WAL) is used for recovery purposes only. Since flushing memstores to disk causes creation of HFiles, HBase has a housekeeping job that merges the HFiles into larger ones using compaction. Various compaction algorithms are supported.
Other HBase architectural components include the client library (API), at least one master server, and many region servers. The region servers can be added or removed while the system is up and running to accommodate increased workloads. The master is responsible for assigning regions to region servers. It uses Apache ZooKeeper, a distributed coordination service, to facilitate that task.
Data is partitioned and replicated across a number of regions located on region servers.
As mentioned above, assignment and distribution of regions to region servers is automatic. However manual management of regions is also possible. When a region’s size reaches a pre-defined threshold, the region will automatically split into two child regions. The split happens along a row key boundary. A single region always manage an entire row. It means that a rows are never divided.
How is it sorted?
Key/value pairs in HBASE maps are kept in an alphabetical order. The amount of data you can store in HBase can be huge and the data you are retrieving via your queries should be near each other.
For example, if you run a query on an HBase table that returns thousands of rows which are distributed across many machines, the latency affected by your network can be significant. This data distribution is determined by a row key of the HBase table. Because of that the row key design is one of the most important aspects of the HBase data modeling (schema design). If a row key is not properly designed it can create hot spotting where a large amount of client traffic is directed at one or few nodes of a cluster.
The row key should be defined in a way that allows related rows to be stored near each other. These related rows will be retrieved by queries and as long as they are stored near each other you should experience good performance. Otherwise the performance of your system will be impacted.
Table
Data is stored in tables that have rows and columns. Table names are strings. They are composed of characters that are safe to use in file system paths. Tables are logically grouped into namespaces by applications, or users, or access control, etc. A namespace is analogous to a database in relational database management systems. A namespace membership is determined during the table creation when tables are fully named.
As previously mentioned, HBASE tables are multi-dimensional maps. A table has multiple rows. A row consists of a row key that is sortable and one or more columns with values associated with them. Rows are uniquely identified by their row key. A row key has no data type and is always treated as an array of bytes. A row key is an equivalent of a primary key in a relational database table. The row key is the only way to access the row. The number of columns per row is arbitrary. It can vary from row to row. Columns are organized in column families. At a conceptual level, tables may be viewed as a sparse set of rows that are physically stored by a column family.
Column Family
Column families are specified when a table is created. They should be carefully designed before a table is created since it would be either impossible or difficult to change them later.
Column families’ names are strings that are composed of characters that are safe to use in file system paths.
All columns in a column family are stored and sorted together in the same HFile.
Column families group columns together physically and logically and they are usually used for a performance reason. A column family has a set of parameters that specify its storage (e.g., caching, compression, etc.). All tuning and storage specifications are done at the column family level. It is important that all column family members have the same or similar access pattern and sizes.
Some shortcomings in the current HBase implementation do not properly support large number of column families in a single table. That number should be in low tens. Most of the time up to three column families should work fine without any significant performance drawback. Ideally you should go with a single column family. The column family names should be as small as possible, preferably one character.
A column family can have an arbitrary number of columns denoted by a column qualifier which is like a column’s label. For example:
{
"row1": {"1": {"color": "green",
"size": 25},
"2": {"weight": 52,
"size": 18}
},
"row2": {"1": {"color": "blue"},
"2": {"height": 192,
"size": 43}
}
}
As you can see in the example above, the same column family (e.g., “1”) in two rows can have different columns. In row “row1”, it has columns “color” and “size”, while in row “row2”, it has only “color” column. It can also have a column that is none of the above. Since rows can have different columns in column families there is no a single way to query for a list of all columns in all column families. This means that you have to do a full table scan.
There is no specific limit on the number of columns in a column family. Actually you can have millions of columns in the single column family.
Column
Columns are usually physically co-located in column families. A column is identified by column family and column qualifier separated by a colon character (:). For example, courses:math. The column family prefix must be composed of printable characters. The column qualifiers (columns) do not have to be defined at schema definition time and they can be added on the fly while the database is up and running.
A column qualifier is an index for a given data and it is added to a column family. Data within a column family is addressed via the column qualifier. Column qualifiers are mutable and they may vary between rows. They do not have data types and they are always treated as arrays of bytes.
A row key, column family and column qualifier form a cell that has a value and timestamp that represents the value’s version. Values also do not have data types and they are always treated as arrays of bytes. A timestamp is recorded for each value and it is the time on the region server when the value was written.
All cell’s values are stored in a descending order by its timestamp. When values are retrieved and if the timestamp is not provided then HBase will return the cell value with the latest (the most recent) timestamp. If a timestamp is not specified during the write, the current timestamp is used.
The maximum number of versions (timestamps) for a given column to store is part of the column schema. It is specified at table creation. It can be specified via alter table command as well. The default value is 1. The minimum number of versions can be also set up per column family. You can also globally set up a maximum number of versions per column.
HBase does not overwrite row values. It stores different values per row by time and column qualifier. Extra versions above the current max version setup are removed during major compactions. If it is not necessary it is not recommended to have very high maximum number of versions since it will increase the HFile size significantly.
It is worth to mention that the column metadata is only stored in internal key/value instances for a column family. You have to keep track of the column names since HBase can support very high number of columns per row and columns can differ between the rows as well. If you do not record these column names by yourself and you forget them you will have to retrieve all rows from a column family in order to find out the column names.
Supported Data Types
Everything that can be converted to an array of bytes can be stored in HBASE. The stored values could be
- Strings
- Numbers
- Counters (atomic increment of numbers)
- Complex objects
- Images
as long as they can be rendered as arrays of bytes.
The size of values should be reasonable. Storing millions of objects with 10-50 MB in size might not make sense.
Namespace
A namespace is a logical grouping of tables analogous to a database in relational database management systems. They are useful when you are dealing with many tables. Namespaces are also used to apply security rules to all tables in a namespace. If you do not specify a namespace when you create a table, it will be automatically added to the “default” tablespace.
A namespace membership is determined during a table creation when the table is fully named as
<table namespace>:<table qualifier>
Joins
HBase does not support joins. Generally there are two options to support joins:
- denormalize the data upon writing to HBase
- have lookup tables and implement the join between HBase tables either in your application code or MapReduce code.
The best join approach depends on a way how you will be using your data and run your application.
HBase Schema Creation
HBASE schema can be created either by HBASE shell or by Java API. When changes are made on a table, it has to be disabled until the changes are complete.
Changes on tables and column families take place when the next major compaction is done and HFiles re-written.
Summary
HBase’s column-oriented architecture allows for huge, wide, sparse tables.
HBase is strongly consistent on a row-level since a single region always manage an entire row.
Multiversioning can help us to avoid edit conflicts caused by concurrent data access processes and also retain data for whatever time it is needed as long as enough storage is provided.
When data schemas are properly designed, HBase provides excellent random read performance and near-optimal write operations in terms of I/O channel saturation.
HBase makes an efficient use of storage by supporting pluggable compression algorithms.
HBase extends the Bigtable model, which only considers a single index. In addition, it provides push-down predicates, that is, filters, reducing data transferred over the network.
Designing the schema in a way to completely avoid explicit locking, combined with row-level atomicity, gives you the ability to scale your system without any notable effect on read or write performance.
There are many tuning techniques that can be used to improve HBase experience and they may be a topic of some future post.