[an error occurred while processing this directive] Dan Ports's publications

Dan Ports's publications

[1] Dan R. K. Ports and Jacob Nelson. When should the network be the computer? In Proceedings of the 17th Workshop on Hot Topics in Operating Systems (HotOS '19), Bertinoro, Italy, May 2019. ACM. [ bib | slides (.pdf) | .pdf ]
Researchers have repurposed programmable network devices to place small amounts of application computation in the network, sometimes yielding orders-of-magnitude performance gains. At the same time, effectively using these devices requires careful use of limited resources and managing deployment challenges.

This paper provides a framework for principled use of in-network processing. We provide a set of guidelines for building robust and deployable in-network primives, along with a taxonomy to help identify which applications can benefit from in-network processing and what types of devices they should use.

[2] Ellis Michael and Dan R. K. Ports. Towards causal datacenter networks. In Proceedings of the 2018 Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '18), Porto, Portugal, April 2018. ACM. [ bib | .pdf ]
Traditionally, distributed systems conservatively assume an asynchronous network. However, recent work on the co-design of networks and distributed systems has shown that stronger ordering properties are achievable in datacenter networks and yield performance improvements for the distributed systems they support. We build on that trend and ask whether it is possible for the datacenter network to order all messages in a protocol-agnostic way. This approach, which we call omnisequencing, would ensure causal delivery of all messages, making consistency a network-level guarantee.
[3] Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and Dan Suciu. A demonstration of interactive analysis of performance measurements with Viska. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, May 2017. ACM. Demonstration. [ bib | .pdf ]
The ultimate goal of system performance analysis is to identify the underlying causes for performance differences between different systems and different workloads. We make it easier to achieve this goal with Viska, a new tool for generating and interpreting performance measurement results. and Viska leverages cutting-edge techniques from big data analytics and data visualization to aid and automate this analysis, and helps users derive meaningful and statistically sound conclusions using state-of-the-art causal inference and hypothesis testing techniques.
[4] Jialin Li, Ellis Michael, and Dan R. K. Ports. Eris: Coordination-free consistent transactions using network multi-sequencing. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17), Shanghai, China, October 2017. ACM. [ bib | .pdf ]
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault-tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a very different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas. It provides atomicity, consistency, and fault-tolerance with less than 10% overhead -- achieving throughput 4.5--35x higher and latency 72--80% lower than a conventional design on standard benchmarks.
[5] Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. Recovering shared objects without stable storage. In Proceedings of the 31st International Symposium on Distributed Computing (DISC '17), Vienna, Austria, October 2017. [ bib | .pdf ]
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover.

To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations.

We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-reader multi-writer atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-reader, single-writer atomic set---a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery with Stable Storage model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.

[6] Babak Salimi, Corey Cole, Dan R. K. Ports, and Dan Suciu. Zaliql: Causal inference from observational data at scale. In Proceedings of the 43rd International Conference on Very Large Data Bases (VLDB '17), August 2017. Demonstration. [ bib | .pdf ]
Causal inference from observational data is a subject of active research and development in statistics and computer science. Many statistical software packages have been developed for this purpose. However, these toolkits do not scale to large datasets. We propose and demonstrate ZaliQL: a SQL-based framework for drawing causal inference from observational data. ZaliQL supports the state-of-the-art methods for causal inference and runs at scale within PostgreSQL database system. In addition, we built a visual interface to wrap around ZaliQL. In our demonstration, we will use this GUI to show a live investigation of the causal effect of different weather conditions on flight delays.
[7] Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and Dan Suciu. Viska: Enabling interactive analysis of performance measurements. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, November 2016. USENIX. Poster. [ bib ]
Much of systems research consists of performance analysis -- to learn when one system outperforms another, to identify architectural choices responsible for the difference, or to identify performance anomalies in particular workloads, for example. However, despite recent advances in data analytics and interactive data visualization, the tools we use for performance analysis remain remarkably primitive.

The Viska project aims to close this gap by providing a new toolkit for systems researchers to generate and interpret performance measurement results, helping users derive meaningful and statistically sound conclusions. Viska leverages cutting-edge techniques from big data analytics and data visualization to aid and automate this analysis.

[8] Brandon Holt, James Bornholt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze. Disciplined inconsistency with consistency types. In Proceedings of the 7th Symposium on Cloud Computing (SOCC '16), Santa Clara, CA, USA, October 2016. ACM. [ bib | .pdf ]
Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test.

In this work, we propose a new programming model for distributed data that makes consistency properties explicit and uses a type system to enforce consistency safety. With the Inconsistent, Performance-bound, Approximate (IPA) storage system, programmers specify performance targets and correctness requirements as constraints on persistent data structures and handle uncertainty about the result of datastore reads using new consistency types. We implement a prototype of this model in Scala on top of an existing datastore, Cassandra, and use it to make performance/correctness tradeoffs in two applications: a ticket sales service and a Twitter clone. Our evaluation shows that IPA prevents consistency-based programming errors and adapts consistency automatically in response to changing network conditions, performing comparably to weak consistency and 2-10x faster than strong consistency.

[9] Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, November 2016. USENIX. [ bib | .pdf ]
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost.
[10] Brandon Holt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze. Claret: Using data types for highly concurrent distributed transactions. In Proceedings of the 2015 Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '15), Bordeaux, France, April 2015. ACM. [ bib | .pdf ]
Out of the many NoSQL databases in use today, some that provide simple data structures for records, such as Redis and MongoDB, are now becoming popular. Building applications out of these complex data types provides a way to communicate intent to the database system without sacrificing flexibility or committing to a fixed schema. Currently this capability is leveraged in limited ways, such as to ensure related values are co-located, or for atomic updates. There are many ways data types can be used to make databases more efficient that are not yet being exploited.

We explore several ways of leveraging abstract data type (ADT) semantics in databases, focusing primarily on commutativity. Using a Twitter clone as a case study, we show that using commutativity can reduce transaction abort rates for high-contention, update-heavy workloads that arise in real social networks. We conclude that ADTs are a good abstraction for database records, providing a safe and expressive programming model with ample opportunities for optimization, making databases more safe and scalable.

[11] Brandon Holt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze. Claret: Using data types for highly concurrent distributed transactions. In Proceedings of the 10th ACM SIGOPS EuroSys (EuroSys '15), Bordeaux, France, April 2015. ACM. Poster. [ bib ]
[12] Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. Designing distributed systems using approximate synchrony in datacenter networks. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15), Oakland, CA, USA, May 2015. USENIX. [ bib | slides (.pdf) | .pdf ]
Distributed systems are traditionally designed independently from the underlying network, making worst-case assumptions (e.g., complete asynchrony) about its behavior. However, many of today's distributed applications are deployed in data centers, where the network is more reliable, predictable, and extensible. In these environments, it is possible to co-design distributed systems with their network layer, and doing so can offer substantial benefits.

This paper explores network-level mechanisms for providing Mostly-Ordered Multicast (MOM): a best-effort ordering property for concurrent multicast operations. Using this primitive, we design Speculative Paxos, a state machine replication protocol that relies on the network to order requests in the normal case. This approach leads to substantial performance benefits: under realistic data center conditions, Speculative Paxos can provide 40% lower latency and 2.6x higher throughput than the standard Paxos protocol. It offers lower latency than a latency-optimized protocol (Fast Paxos) with the same throughput as a throughput-optimized protocol (batching).

[13] Naveen Kr. Sharma, Brandon Holt, Irene Zhang, Dan R. K. Ports, and Marcos Aguilera. Transtorm: a benchmark suite for transactional key-value storage systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, October 2015. ACM. Poster. [ bib ]
[14] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, October 2015. ACM. [ bib | .pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications.

We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.

[15] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of the tail: Hardware, OS, and application-level sources of tail latency. In Proceedings of the 5th Symposium on Cloud Computing (SOCC '14), Seattle, WA, USA, November 2014. ACM. [ bib | .pdf ]
Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines.

We model these network services as a queuing system in order to establish the best-achievable latency distribution. Using fine-grained measurements of three different servers (a null RPC service, Memcached, and Nginx) on Linux, we then explore why these servers exhibit significantly worse tail latencies than queuing models alone predict. The underlying causes include interference from background processes, request re-ordering caused by poor scheduling or constrained concurrency models, suboptimal interrupt routing, CPU power saving mechanisms, and NUMA effects.

We systematically eliminate these factors and show that Memcached can achieve a median latency of 11 us and a 99.9th percentile latency of 32 us at 80% utilization on a four-core system. In comparison, a naive deployment of Memcached at the same utilization on a single-core system has a median latency of 100 us and a 99.9th percentile latency of 5 ms. Finally, we demonstrate that tradeoffs exist between throughput, energy, and tail latency.

[16] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14), Broomfield, CO, USA, October 2014. USENIX. [ bib | .pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2-5x in latency and 9x in throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation
[17] Simon Peter, Jialin Li, Doug Woos, Irene Zhang, Dan R. K. Ports, Thomas Anderson, Arvind Krishnamurthy, and Mark Zbikowski. Towards high-performance application-level storage management. In Proceedings of the 5th Hot Topics in Storage and File Systems (HotStorage '14), Philadelphia, PA, USA, June 2014. USENIX. [ bib | .pdf ]
We propose a radical re-architecture of the traditional operating system storage stack to move the kernel off the data path. Leveraging virtualized I/O hardware for disk and flash storage, most read and write I/O operations go directly to application code. The kernel dynamically allocates extents, manages the virtual to physical binding, and performs name translation. The benefit is to dramatically reduce the CPU overhead of storage operations while improving application flexibility.
[18] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Optimistic replicated two-phase commit. In Proceedings of the 5th Asia-Pacific Workshop on Systems (APSYS '14), Beijing, China, June 2014. Poster and extended abstract. [ bib ]
[19] Danyang Zhuo, Qiao Zhang, Dan R. K. Ports, Arvind Krishnamurthy, and Thomas Anderson. Machine fault tolerance for reliable datacenter systems. In Proceedings of the 5th Asia-Pacific Workshop on Systems (APSYS '14), Beijing, China, June 2014. [ bib | .pdf ]
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at data center scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilience to these types of failures. Since most machine failures manifest in application server and operating system code, we assume a Byzantine model for those parts of the system. However, by assuming that the hypervisor and network are trustworthy, we are able to reduce the overhead of machine-fault masking to be close to that of non-Byzantine Paxos.
[20] Winnie Cheng, Dan R. K. Ports, David Schultz, Victoria Popic, Aaron Blankstein, James Cowling, Dorothy Curtis, Liuba Shrira, and Barbara Liskov. Abstractions for usable information flow control in Aeolus. In Proceedings of the 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 2012. USENIX. [ bib | slides (.pdf) | .pdf ]
Despite the increasing importance of protecting confidential data, building secure software remains as challenging as ever. This paper describes Aeolus, a new platform for building secure distributed applications. Aeolus uses information flow control to provide confidentiality and data integrity. It differs from previous information flow control systems in a way that we believe makes it easier to understand and use. Aeolus uses a new, simpler security model, the first to combine a standard principal-based scheme for authority management with thread-granularity information flow tracking. The principal hierarchy matches the way developers already reason about authority and access control, and the coarse-grained information flow tracking eases the task of defining a program's security restrictions. In addition, Aeolus provides a number of new mechanisms (authority closures, compound tags, boxes, and shared volatile state) that support common design patterns in secure application design.
[21] Dan R. K. Ports, Austin T. Clements, Irene Zhang, Samuel Madden, and Barbara Liskov. Transactional consistency and automatic management in an application data cache. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada, October 2010. USENIX. [ bib | slides (.pdf) | .ps.gz | .pdf ]
Distributed in-memory application data caches like memcached are a popular solution for scaling database-driven web sites. These systems are easy to add to existing deployments, and increase performance significantly by reducing load on both the database and application servers. Unfortunately, such caches do not integrate well with the database or the application. They cannot maintain transactional consistency across the entire system, violating the isolation properties of the underlying database. They leave the application responsible for locating data in the cache and keeping it up to date, a frequent source of application complexity and programming errors.

Addressing both of these problems, we introduce a transactional cache, TxCache, with a simple programming model. TxCache ensures that any data seen within a transaction, whether it comes from the cache or the database, reflects a slightly stale but consistent snapshot of the database. TxCache makes it easy to add caching to an application by simply designating functions as cacheable; it automatically caches their results, and invalidates the cached data as the underlying database changes. Our experiments found that adding TxCache increased the throughput of a web application by up to 5.2x, only slightly less than a non-transactional cache, showing that consistency does not have to come at the price of performance.

[22] James Cowling, Dan R. K. Ports, Barbara Liskov, Raluca Ada Popa, and Abhijeet Gaikwad. Census: Location-aware membership management for large-scale distributed systems. In Proceedings of the 2009 USENIX Annual Technical Conference, San Diego, CA, USA, June 2009. USENIX. [ bib | slides (.pdf) | .ps.gz | .pdf ]
We present Census, a platform for building large-scale distributed applications. Census provides a membership service and a multicast mechanism. The membership service provides every node with a consistent view of the system membership, which may be global or partitioned into location-based regions. Census distributes membership updates with low overhead, propagates changes promptly, and is resilient to both crashes and Byzantine failures. We believe that Census is the first system to provide a consistent membership abstraction at very large scale, greatly simplifying the design of applications built atop large deployments such as multi-site data centers.

Census builds on a novel multicast mechanism that is closely integrated with the membership service. It organizes nodes into a reliable overlay composed of multiple distribution trees, using network coordinates to minimize latency. Unlike other multicast systems, it avoids the cost of using distributed algorithms to construct and maintain trees. Instead, each node independently produces the same trees from the consistent membership view. Census uses this multicast mechanism to distribute membership updates, along with application-provided messages.

We evaluate the platform under simulation and on a real-world deployment on PlanetLab. We find that it imposes minimal bandwidth overhead, is able to react quickly to node failures and changes in the system membership, and can scale to substantial size.

[23] Dan R. K. Ports, Austin T. Clements, Irene Y. Zhang, Samuel Madden, and Barbara Liskov. Transactional caching of application data using recent snapshots. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09), Big Sky, MT, USA, October 2009. ACM. Work in Progress report. [ bib | slides (.pdf) | .ps.gz | .pdf ]
Many of today's well-known websites use application data caches to reduce the bottleneck load on the database, as well as the computational load on the application servers. Distributed in-memory shared caches, exemplified by memcached, are one popular approach. These caches typically provide a get/put interface, akin to a distributed hash table; the application chooses what data to keep in the cache and keeps it up to date. By storing the cache entirely in memory and horizontally partitioning among nodes, in-memory caches provide quick response times and ease of scaling.

However, existing caches have no notion of transactional consistency: there is no way to ensure that two accesses to the cache reflect a view of the database at the same point in time. While the backing database goes to great lengths to ensure this property (serializable isolation), the caching layer violates these guarantees. The resulting inconsistencies can have unpleasant consequences if exposed to the user (e.g., attributing the latest bid to the wrong user on an auction site), or add complexity to application code by forcing it to cope with temporarily violated invariants.

We argue that transactional semantics are not incompatible with cache performance and scalability. We introduce a transactional cache, TxCache, which guarantees that all values retrieved from the cache or database during a transaction reflect a consistent snapshot of the database.

TxCache also strives to simplify application design by helping manage the cache. Instead of requiring applications to manually insert and check for values in the cache, TxCache provides a library with which programmers simply designate functions as cacheable, and the library checks the cache for previous calls with the same arguments. In particular, and unlike memcached, TxCache does not require applications to explicitly invalidate cached values; correctly identifying the values to invalidate is difficult because it requires global reasoning about the application.

[24] Xiaoxin Chen, Tal Garfinkel, E. Christopher Lewis, Pratap Subrahmanyam, Carl A. Waldspurger, Dan Boneh, Jeffrey Dwoskin, and Dan R. K. Ports. Overshadow: A virtualization-based approach to retrofitting protection in commodity operating systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08), Seattle, WA, USA, March 2008. ACM. [ bib | .ps.gz | .pdf ]
Commodity operating systems entrusted with securing sensitive data are remarkably large and complex, and consequently, frequently prone to compromise. To address this limitation, we introduce a virtual-machine-based system called Overshadow that protects the privacy and integrity of application data, even in the event of a total OS compromise. Overshadow presents an application with a normal view of its resources, but the OS with an encrypted view. This allows the operating system to carry out the complex task of managing an application's resources, without allowing it to read or modify them. Thus, Overshadow offers a last line of defense for application data.

Overshadow builds on multi-shadowing, a novel mechanism that presents different views of “physical” memory, depending on the context performing the access. This primitive offers an additional dimension of protection beyond the hierarchical protection domains implemented by traditional operating systems and processor architectures.

We present the design and implementation of Overshadow and show how its new protection semantics can be integrated with existing systems. Our design has been fully implemented and used to protect a wide range of unmodified legacy applications running on an unmodified Linux operating system. We evaluate the performance of our implementation, demonstrating that this approach is practical.

[25] Dan R. K. Ports and Tal Garfinkel. Towards application security on untrusted operating systems. In Proceedings of the 3rd Workshop on Hot Topics in Security (HotSec '08), San Jose, CA, USA, July 2008. USENIX. [ bib | slides (.pdf) | .ps.gz | .pdf ]
Complexity in commodity operating systems makes compromises inevitable. Consequently, a great deal of work has examined how to protect security-critical portions of applications from the OS through mechanisms such as microkernels, virtual machine monitors, and new processor architectures. Unfortunately, most work has focused on CPU and memory isolation and neglected OS semantics. Thus, while much is known about how to prevent OS and application processes from modifying each other, far less is understood about how different OS components can undermine application security if they turn malicious.

We consider this problem in the context of our work on Overshadow, a virtual-machine-based system for retrofitting protection in commodity operating systems. We explore how malicious behavior in each major OS subsystem can undermine application security, and present potential mitigations. While our discussion is presented in terms of Overshadow and Linux, many of the problems and solutions are applicable to other systems where trusted applications rely on untrusted, potentially malicious OS components.

[26] Austin T. Clements, Dan R. K. Ports, and David R. Karger. Arpeggio: Metadata searching and content sharing with Chord. In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS '05), volume 3640 of Lecture Notes in Computer Science, pages 58--68, Ithaca, NY, USA, February 2005. Springer. [ bib | slides (.pdf) | .ps.gz | .pdf ]
Arpeggio is a peer-to-peer file-sharing network based on the Chord lookup primitive. Queries for data whose metadata matches a certain criterion are performed efficiently by using a distributed keyword-set index, augmented with index-side filtering. We introduce index gateways, a technique for minimizing index maintenance overhead. Because file data is large, Arpeggio employs subrings to track live source peers without the cost of inserting the data itself into the network. Finally, we introduce postfetching, a technique that uses information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization, and a content distribution system tuned to the requirements and capabilities of a peer-to-peer network.
[27] Dan R. K. Ports, Austin T. Clements, and Erik D. Demaine. PersiFS: A versioned file system with an efficient representation. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP '05), Brighton, United Kingdom, October 2005. ACM. Poster and extended abstract. [ bib ]
[28] Austin T. Clements, Dan R. K. Ports, and David R. Karger. Arpeggio: Efficient metadata-based searching and file transfer with DHTs. In Proceedings of the 2nd Project IRIS Student Workshop (ISW '04), Cambridge, MA, USA, November 2004. Poster and extended abstract. [ bib ]
Arpeggio is a peer-to-peer file-sharing network based on the Chord distributed hash table. Queries for files whose metadata matches a certain criterion are performed efficiently by using a distributed keyword-set index, augmented with index-side filtering. We introduce metadata gateways, a technique for minimizing index maintenance overhead. Arpeggio also uses the DHT for indirect storage of file contents, maintaining pointers from content to the live peers that provide it. Finally, we introduce postfetching, a technique that uses information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization, and a content distribution system tuned to the requirements of a peer-to-peer file-sharing network.
[29] Dan R. K. Ports. Arpeggio: Metadata indexing in a structured peer-to-peer network. M.eng. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, February 2007. [ bib | .ps.gz | .pdf ]
Peer-to-peer networks require an efficient means for performing searches for files by metadata keywords. Unfortunately, current methods usually sacrifice either scalability or recall. Arpeggio is a peer-to-peer file-sharing network that uses the Chord lookup primitive as a basis for constructing distributed keyword-set index, augmented with index-side filtering, to address this problem. We introduce index gateways, a technique for minimizing index maintenance overhead. Arpeggio also includes a content distribution system for finding source peers for a file; we present a novel system that uses Chord subrings to track live source peers without the cost of inserting the data itself into the network, and supports postfetching: using information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization. We use analysis and simulation results to show that our indexing system has reasonable storage and bandwidth costs, and improves load distribution.
[30] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter Richtarik. Scaling distributed machine learning with in-network aggregation. Technical report, KAUST, February 2019. [ bib | .pdf ]
Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.
[31] Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. Harmonia: Near-linear scalability for replicated storage with in-network conflict detection. Proceedings of the VLDB Endowment, 2019. [ bib ]
Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primary-backup, chain replication, Viewstamped Replication, and NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10x for a replication factor of 10, providing near-linear scalability up to the limit of our testbed.
[32] Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. Harmonia: Near-linear scalability for replicated storage with in-network conflict detection. arXiv preprint 1904.08964, arXiv, April 2019. [ bib | .pdf ]
Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primary-backup, chain replication, Viewstamped Replication, and NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10X for a replication factor of 10, providing near-linear scalability up to the limit of our testbed.
[33] Jialin Li, Jacob Nelson, Xin Jin, and Dan R. K. Ports. Pegasus: Load-aware selective replication with an in-network coherence directory. Technical Report UW-CSE-18-12-01, University of Washington CSE, Seattle, WA, USA, December 2018. [ bib | .pdf ]
High performance distributed storage systems face the challenge of load imbalance caused by skewed and dynamic workloads. This paper introduces Pegasus, a new storage architecture that leverages new-generation programmable switch ASICs to balance load across storage servers. Pegasus uses selective replication of the most popular objects in the data store to distribute load. Using a novel in-network coherence directory, the Pegasus switch tracks and manages the location of replicated objects. This allows it to achive load-aware forwarding and dynamic rebalancing for replicated keys, while still guaranteeing data coherence. The Pegasus design is practical to implement as it stores only forwarding metadata in the switch data plane. The resulting system improves the 99% tail latency of a distributed in-memory key-value store by more than 95%, and yields up to a 9x throughput improvement under a latency SLO -- results which hold across a large set of workloads with varying degrees of skewness, read/write ratio, and dynamism.
[34] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication. ACM Transactions on Computer Systems, (4):12, December 2018. [ bib | .pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications.

We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides both better latency and better throughput.

[35] Jialin Li, Ellis Michael, and Dan R. K. Ports. Eris: Coordination-free consistent transactions using network multi-sequencing (extended version). Technical Report UW-CSE-TR-17-10-01, University of Washington CSE, Seattle, WA, USA, October 2017. [ bib | .pdf ]
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault-tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a very different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas. It provides atomicity, consistency, and fault-tolerance with less than 10% overhead -- achieving throughput 4.5--35x higher and latency 72--80% lower than a conventional design on standard benchmarks.
[36] Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. Recovering shared objects without stable storage (extended version). Technical Report UW-CSE-17-08-01, University of Washington CSE, Seattle, WA, USA, August 2017. [ bib | .pdf ]
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover.

To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations.

We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-reader multi-writer atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-reader, single-writer atomic set---a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery with Stable Storage model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.

[37] Brandon Holt, James Bornholt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze. Disciplined inconsistency. Technical Report UW-CSE-TR-16-06-01, University of Washington CSE, Seattle, WA, USA, June 2016. [ bib | .pdf ]
Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test.

In this work, we propose a new programming model for distributed data that makes consistency properties explicit and uses a type system to enforce consistency safety. With the Inconsistent, Performance-bound, Approximate (IPA) storage system, programmers specify performance targets and correctness requirements as constraints on persistent data structures and handle uncertainty about the result of datastore reads using new *consistency types*. We implement a prototype of this model in Scala on top of an existing datastore, Cassandra, and use it to make performance/correctness tradeoffs in two applications: a ticket sales service and a Twitter clone. Our evaluation shows that IPA prevents consistency-based programming errors and adapts consistency automatically in response to changing network conditions, performing comparably to weak consistency and 2-10x faster than strong consistency.

[38] Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering (extended version). Technical Report UW-CSE-TR-16-09-02, University of Washington CSE, Seattle, WA, USA, 2016. [ bib | .pdf ]
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost.
[39] Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. Providing stable storage for the diskless crash-recovery failure model. Technical Report UW-CSE-TR-16-08-02, University of Washington CSE, Seattle, WA, USA, August 2016. [ bib | .pdf ]
Many classic protocols in the fault tolerant distributed computing literature assume a Crash-Fail model in which processes either are up, or have crashed and are permanently down. While this model is useful, it does not fully capture the difficulties many real systems must contend with. In particular, real-world systems are long-lived and must have a recovery mechanism so that crashed processes can rejoin the system and restore its fault-tolerance. When processes are assumed to have access to stable storage that is persistent across failures, the Crash-Recovery model is trivial. However, because disk failures are common and because having a disk on a protocol's critical path is often performance concern, diskless recovery protocols are needed. While such protocols do exist in the state machine replication literature, several well-known protocols have flawed recovery mechanisms. We examine these errors to elucidate the problem of diskless recovery and present our own protocol for providing virtual stable storage, transforming any protocol in the Crash-Recovery with stable storage model into a protocol in the Diskless Crash-Recover model.
[40] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. When is operation ordering required in replicated transactional storage? IEEE Data Engineering Bulletin, 39(1):27--38, March 2016. [ bib | .pdf ]
Today's replicated transactional storage systems typically have a layered architecture, combining protocols for transaction coordination, consistent replication, and concurrency control. These systems generally require costly strongly-consistent replication protocols like Paxos, which assign a total order to all operations. To avoid this cost, we ask whether all replicated operations in these systems need to be strictly ordered. Recent research has yielded replication protocols that can avoid unnecessary ordering, e.g., by exploiting commutative operations, but it is not clear how to apply these to replicated transaction processing systems. We answer this question by analyzing existing transaction processing designs in terms of which replicated operations require ordering and which simply require fault tolerance. We describe how this analysis leads to our recent work on TAPIR, a transaction protocol that efficiently provides strict serializability by using a new replication protocol that provides fault tolerance but not ordering for most operations.
[41] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. ACM Transactions on Computer Systems, 33(4), November 2015. [ bib | .pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2 to 5 x in latency and 9x throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.
[42] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication (extended version). Technical Report UW-CSE-2014-12-01 v2, University of Washington CSE, October 2015. [ bib | .pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications.

We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.

[43] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of the tail: Hardware, OS, and application-level sources of tail latency. Technical Report UW-CSE-14-04-01, University of Washington CSE, Seattle, WA, USA, April 2014. [ bib | .pdf ]
Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines.

We first review the basic queuing theory that governs service latency. Using fine-grained measurements of three different servers (a null RPC service, Memcached, and Nginx) on Linux, we then explore why these servers exhibit significantly worse tail latencies than queuing models alone predict. The underlying causes include interference from background processes, request re-ordering caused by poor scheduling or constrained concurrency models, suboptimal interrupt routing, CPU power saving mechanisms, and NUMA effects.

We systematically eliminate these factors and show that Memcached can achieve a median latency of 11 us and a 99.9th percentile latency of 32 us at 75% utilization. In comparison, a naive deployment of Memcached has a median latency of 33 us and a 99.9th percentile latency of 14 ms. Finally, we demonstrate that a tradeoff often exists between throughput and tail latency.

[44] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. Technical Report UW-CSE-13-10-01, version 2.0, University of Washington CSE, Seattle, WA, USA, May 2014. [ bib | .pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing 2-5x end-to-end latency and 9x throughput improvements for a popular persistent NoSQL store relative to a well-tuned Linuxv implementation.
[45] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication. Technical Report UW-CSE-2014-12-01, University of Washington CSE, December 2014. [ bib | .pdf ]
[46] Peter Hornyack, Luis Ceze, Steven D. Gribble, Dan R. K. Ports, and Henry M. Levy. A study of virtual memory usage and implications for large memory. Technical report, University of Washington CSE, Seattle, WA, 2013. [ bib | .pdf ]
The mechanisms now used to implement virtual memory - pages, page tables, and TLBs - have worked remarkably well for over fifty years. However, they are beginning to show their age due to current trends, such as significant increases in physical memory size, emerging data-intensive applications, and imminent non-volatile main memory. These trends call into question whether page-based address-translation and protection mechanisms remain viable solutions in the future. In this paper, we present a detailed study of how modern applications use virtual memory. Among other topics, our study examines the footprint of mapped regions, the use of memory protection, and the overhead of TLBs. Our results suggest that a segment-based translation mechanism, together with a fine-grained protection mechanism, merit consideration for future systems.
[47] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. Technical Report UW-CSE-13-10-01, University of Washington CSE, Seattle, WA, USA, October 2013. [ bib | .pdf ]
Recent device hardware trends enable a new approach to the design of network servers. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely. The Arrakis kernel operates only in the control plane. We describe the the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing significant latency and throughput improvements for network server applications relative to a well-tuned Linux implementation.
[48] Dan R. K. Ports. Application-Level Caching with Transactional Consistency. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, June 2012. [ bib | .pdf ]
Distributed in-memory application data caches like memcached are a popular solution for scaling database-driven web sites. These systems increase performance significantly by reducing load on both the database and application servers. Unfortunately, such caches present two challenges for application developers. First, they cannot ensure that the application sees a consistent view of the data within a transaction, violating the isolation properties of the underlying database. Second, they leave the application responsible for locating data in the cache and keeping it up to date, a frequent source of application complexity and programming errors.

This thesis addresses both of these problems in a new cache called TxCache. TxCache is a transactional cache: it ensures that any data seen within a transaction, whether from the cache or the database, reflects a slightly stale but consistent snapshot of the database. TxCache also offers a simple programming model. Application developers simply designate certain functions as cacheable, and the system automatically caches their results and invalidates the cached data as the underlying database changes.

Our experiments found that TxCache can substantially increase the performance of a web application: on the RUBiS benchmark, it increases throughput by up to 5.2x relative to a system without caching. More importantly, on this application, TxCache achieves performance comparable (within 5%) to that of a non-transactional cache, showing that consistency does not have to come at the price of performance.

[49] Dan R. K. Ports and Kevin Grittner. Serializable snapshot isolation in PostgreSQL. Proceedings of the VLDB Endowment, 5(12):1850--1861, August 2012. [ bib | slides (.pdf) | .pdf ]
This paper describes our experience implementing PostgreSQL's new serializable isolation level. It is based on the recently-developed Serializable Snapshot Isolation (SSI) technique. This is the first implementation of SSI in a production database release as well as the first in a database that did not previously have a lock-based serializable isolation level. We reflect on our experience and describe how we overcame some of the resulting challenges, including the implementation of a new lock manager, a technique for ensuring memory usage is bounded, and integration with other PostgreSQL features. We also introduce an extension to SSI that improves performance for read-only transactions. We evaluate PostgreSQL's serializable isolation level using several benchmarks and show that it achieves performance only slightly below that of snapshot isolation, and significantly outperforms the traditional two-phase locking approach on read-intensive workloads.

This file was generated by bibtex2html 1.99.