The Hash Partitioning Pattern aims to improve the performance of parallel and batch queries.
Example
As part of the pre-processing done for Wikipedia Explorer, the HTML/XML representation of a Wikipedia page is converted in to a set of POCO objects and the incoming and outbound links are analysed. The core data is stored in a Windows Azure Table and is used when pre-processing a page, which is a massively CPU intensive operation. As optimisations this work is done in parallel and is also for batches of records to reduce latency.
Challenge
The problem involves the selection of a partition key that is both optimised for parallel and batch processing.
| PartitionKey (PageId) |
RowKey |
Title |
Status |
RenderedUrl |
| 123456 |
(guid) |
St. Laurence’s College |
Extracted |
… |
| 123457 |
(guid) |
David Hartman (TV personality) |
Converted |
… |
| 123458 |
(guid) |
Liz Parker Evans |
Converted |
… |
Too maximise the performance of a single query all the data would ideal be in a single partition. Too maximise performance of parallel queries the data would be distributed across many partitions.
Solution
A solution is to define a finite number of partitions to be used as buckets. The choice of the number of buckets is a balance of the performance of the writes and reads of data and the parallel and batch queries being performed. The naming of the buckets should not relate directly to the entity being stored within that partition to ensure a pseudo distribution of entities between partitions.
| PartitionKey |
RowKey |
PageId |
Title |
Status |
| 00 |
(guid) |
123456 |
St. Laurence’s College |
Extracted |
| 00 |
(guid) |
123457 |
David Hartman (TV personality) |
Converted |
| 00 |
(guid) |
123458 |
Liz Parker Evans |
Converted |
In this example 256 buckets or partitions were used and the pages pseudo randomly assigned to each bucket. This provides an approximately uniform distribution of entities across the partitions, which can be visualised as:
Summary
Motivation:
To improve the performance of parallel and batch queries.
Implementation:
By creating a finite number of partitions and creating an approximately uniformly distribution of entities across those partitions.
Uses:
- Improve performance of parallel entity reads or writes.
- To provide support for random batch reads or writes.
Reference
Azure Application Demonstrations of Wikipedia Explorer and ScrumWall
Windows Azure Tables and Queues Deep Dive
Hash partitioning
Also See
Table Name Key Pattern
Hash Partitioning Pattern
Transactional Master-Item Record Pattern
Chronological Query Pattern
Starts With Query Pattern
Author: Marcus Tillett
@drmarcustillett
Tweet