Irfan's blog covers the latest in high performance virtual machine technology, focusing on real issues in the day-to-day operations of virtual infrastructures.
Submitted by irfan on Sat, 2011-04-09 19:31.
Folks: just a heads up ... The HotStorage 2011 PC meeting is being held on Monday April 11.
As program chair this year, I'm hosting the meeting at VMware which will be including some of the brightest minds in storage research in the world. Building E, here we come. If there are brown-outs, it's probably due to the sheet amount of brainpower assembled there :) Here is the full list.
Several people are flying in, some are local and yet others are dialing in. When assembling this PC, I had several objectives. One of course was to collect up the top brains in this fast-moving area of research. Another was diversity, of every type. I'm super impressed with the group of people who agreed to serve on the committee.
As for the program, we had a record number of submissions (60% more than last year) which just goes to show you how active this area is. The review rounds are done leading up to Monday's meeting. I've been spending time organizing the papers to optimize our team. There are so many good ones that I'm sure the selection process will not be easy.
As is appropriate for all academic venues of good repute, HotStorage has a very strict conflicts policy. So, even as chair, I'll sit out some paper discussions to avoid even the potential appearance of conflicts against some papers from colleagues or ex-colleagues. The same applies to all PC members. Another thing that I have done is require extra reviewing for PC member papers which lifts the quality bar for them.
I'll post interesting tid bits from the meeting later.
Submitted by irfan on Sat, 2011-03-05 17:44.
The most awesome thing I've heard in a while is effectively getting paid to run and share vscsiStats data. See Chad Sakac's on this topic blog posting on this topic.
I should ask for royalties ;)
More seriously, this is very interesting and a win-win. Getting real customer data is always difficult and Chad has got it figured out. Customers on the other hand are assured that their data is anonymized (besides, vscsiStats doesn't capture any real customer data anyway, just the workload characteristics) and get a cool, super-useful tool in return.
Look forward to more vendors trying this ... :)
Submitted by irfan on Sun, 2010-10-24 22:49.
Folks, is it just me or does vscsiStats seem to have gone viral? Here's a couple of the posts that are seeing a lot of retweets.
ps. I haven't mentioned here that you can follow me on twitter with the handle @virtualirfan
Submitted by irfan on Sun, 2010-10-24 22:31.
I'm deeply honored to have been asked by USENIX to serve as the Program Chair for the 3nd Workshop on Hot Topics in Storage and File Systems (HotStorage '11).
The workshop CfP is about to come out any day. I just finished assembling the program committee and writing the workshop overview last week. HotStorage is an awesome place to send your cool ideas. The program committee is absolutely top notch. How top-notch, you ask? Well, you can deal with a little suspense ... I don't want to jump the gun on the CfP yet.
In case you are interested, here are some links. HotStorage '09 program committee & CfP, papers. HotStorage '10 program committee, papers.
So, start working on those papers ... :)
Submitted by irfan on Thu, 2010-06-17 06:13.
I am honored to have been asked to chair a session at the HotStorage 2010 workshop on Boston. Take a look at the program. My session includes two very interesting papers:
Funnily enough, Jiri chose the session title to be "All Aboard HMS Beagle". Here's his explanation: "the session name refers to Charles Darwin's ship named Beagle. I chose the name because there isn't really much technical commonality other than the words Adaptive and Evolution (hence the reference)".
If folks are in the area, please consider registering and popping in. USENIX workshops are always very exciting mixers for industry and academia.
Submitted by irfan on Fri, 2010-06-11 15:02.
Just got my approval... 2nd one for our team on storage features. One on Storage I/O Control and another a tech preview. Yea!
Submitted by irfan on Thu, 2010-06-10 20:46.
Someone was discussing this topic with me so I thought to blog about it. The issue was sequential-read-after-random-write and how LFS can do really bad in those cases. Here's a bunch of ideas I suggested to my colleague.
If anyone knows about or find such workloads, I’d love to learn more about them. Please let me know how common they are.
Given that WAFL is essentially LFS and that ZFS has gone all the way to LFS, I’m really curious as to how real the fear of that particular workload really is. The other thing is that most people at the second level cache do read-aheads and often the read-aheads end up becoming random anyway (apart from the initial padded read, the prefetches are issued async from the original IO), LFS implementations that want to protect themselves against Sequential-Read-after-Rand-Write should be able to mitigate the problem by doing fIle-offset-based read-aheads (as opposed to LBN read-ahead).
Next issue is extra meta-data IO. That ZFS has to do as well so I’m sure this is doable :)
That leaves garbage collection which really is a problem. This has of course been studied in literature and I think we can do a pretty good job with some research. For instance, at one extreme where lossiness is allowed, see Peter Desnoyers' USENIX Annual Technical Conference paper. I’ve always wondered if we dial back the fidelity to requiring full fidelity (but only up to the last version, so older block history isn’t needed) how close we can get to no interference from garbage collection using Peter's ideas. Anyway, garbage collection with uniform block sizes lends itself to many more tricks than the memory objects version of that problem :)
Anyway, thoughts welcome.
Submitted by irfan on Fri, 2010-01-15 21:49.
Recently, I came across a great video of a talk by Keith Adams, a colleague of mine at VMware. The talk is entitled "They make computers out of software, now". In addition to a great introduction to system virtualization, Keith talks about VProbes which many see as an important building block of the future computing platforms. Well worth your time.
Submitted by irfan on Mon, 2009-07-06 06:58.
One of the interesting papers presented at USENIX 2009 was "Black-Box Performance Control for High-Volume Non-Interactive Systems" [pdf][html[slides]. Since this is right up my alley, I paid close attention and took some notes. The paper was authored by several IBM Research folks: Chunqiang Tang, Sunjit Tara, Rong N. Chang and Chun Zhang.
First of all, this is interesting and thought-provoking work. However, the paper deals with a very constrained environment of throughput-centric systems and with only a single pool of threads. I have reservations about the general applicability of the system to, say, disk scheduling. Nevertheless, their black box treatment of the system (multiple unknown bottlenecks) is quite interesting and it really made me wonder how else it could be extended. The main problem is that if you have multiple controls in the system (e.g. cpu, memory, disk, etc) that the effective online search they are performing will get really tricky. Nevertheless, good food for thought.
Submitted by irfan on Sat, 2009-03-07 19:48.
Several influential bloggers have now picked up on the PARDA research paper and its implication to the future of storage resource management. Here a few of note:
Many thanks to these individuals for their favorable coverage.
Submitted by irfan on Fri, 2009-03-06 17:09.
There is an active Call for Papers for the Second International Workshop on Virtualization Performance: Analysis, Characterization, and Tools (VPACT’09). I feel very lucky to have been asked to serve on the program committee (PC) for this excellent workshop by Peter Varman, the general chair. Peter is a superb researcher and I really like the work that he and his students have been doing in the area of QoS for storage systems. PC membership means that I'll reviewing papers submitted to the conference and selecting the best ones for presentation and for publication in the proceedings.
If you have interesting ideas that you'd like to run by the research community in the following areas, please do consider submitting your work.
The workshop is intended as a venue for researchers and practitioners in academia and industry to present their unpublished results in the area of virtualization research. Papers are solicited on topics including, but not limited to the following aspects of virtual machine (VM) execution:
• VM analytical performance modeling
• VM performance tools for tracing, profiling, and simulation
• VM benchmarking and performance metrics
• Workload characterization in a virtualized environment
• Evaluation of resource scheduling
• Models and metrics for new VM usages
• VM energy and power modeling
Take a look at the list of program committee members to see if you recognize any of those names.
Submitted by irfan on Wed, 2009-02-25 15:52.
Just listening to Alyssa Henry's keynote talk at FAST '09. She is the General Manager of S3 at Amazon. She used a great analogy to explain the difficult choice of which thing to spend resources on to protect against failures in a highly distributed system. For some things we choose to have expensive redundancy, e.g. we use both seat belts as well as air bags. Protecting one's life in a catastrophic situation is important enough to warrant the extra expense. But we tend not to use both waist belts as well as suspenders :-)
Alyssa also talked about "retry" as an important part of building resilient systems. To handle failures in distributed systems where messages may be lost or nodes may go down, just retry. But what about a message to charge a customer some amount of money? Do you really want to resend that request? The point was that they needed to think about making some operations idempotent by design.
According to Alyssa, the next failure after retry was solved, was surge/overload. Retries can be overwhelming to a system recovering from failure. So rate limiting might be used e.g. exponential backoff. Related are cache time-to-live (TTL) leases expiring but the underlying system which is the source of the data is down. As that system is comming back up, it would get overwhelmed. Alyssa suggested to try extending the TTL to keep the underlying system from breaking down when it comes back up. For example, there is a service at Amazon that checks if a customer's Account is live. In case that service is down, it's client systems just continue to assume that the customer is still in good standing.
She also talked about trading consistency with availability. When you write to S3, they will send data to multiple data centers. They write pointers to more datacenters than the data itself.
Submitted by irfan on Sun, 2009-02-15 15:12.
The USENIX Conference on File and Storage Technologies (FAST) is the premier place to send papers on all things storage. The program committee is usually the who's who of the field. For the last few years, VMware has been holding a birds of a feather (BoF) session on the intersection of virtualization and storage/filesystem technologies. The BoF chair this year is a good friend of mine, Ajay Gulati.
Ajay has setup a really cool program that I think will attract a large crowd. Take a look at the following and be sure to drop by if you are lucky enough to be attending the conference (or even if you are not, but find yourself in the area, you are welcome to drop by our meeting room). I'm particularly excited about the demos!
Storage Technologies and Challenges in Virtualized Environments
VMware Vendor BoF
Thursday, February 26, 7:30 p.m.–8:30 p.m., San Francisco C
Do you wonder what VMware has to do with storage? Are you interested in learning about VMware technologies beyond core server virtualization? Do you want to get a glimpse of some of the future products and what storage applications they can enable?
Join engineers from VMware in a discussion about a number of novel storage-related technologies that VMware has been working on. We will also discuss some of the currently open problems and challenges related to better storage performance and management.
We will give two live demos:
1) Online storage migration (Storage VMotion)
2) Transparent and efficient workload characterization of VM workloads inside ESX Server
In addition, there will be a number of manned stations with posters and demos of technologies such as Distributed Storage IO Resource Management, VMware's Cluster File System (VMFS), ESX's Pluggable Storage Stack, VM aware storage (VMAS) and our dynamic Virtual Machine instrumentation tool called VProbes.
Submitted by irfan on Sun, 2009-02-15 14:28.
As part of our PARDA research, we examined how IO latency varies with increases in overall load (queue length) at the array using one to five hosts accessing the same storage array. The attached image (Figure 6 from the paper) shows the aggregate throughput and average latency observed in the system, with increasing contention at the array. The generated workload is a uniform 16 KB IOs, 67% reads and 70% random, while keeping 32 IOs outstanding from each host. It can be clearly seen that, for this experiment, throughput peaked at three hosts, but overall latency continues to increase with load. In fact, in some cases, beyond a certain level of workload parallelism, throughput can even drop.
An important question to consider for application performance is whether bandwidth is more important or latency. If the former, then pushing the outstanding IOs higher might make sense up to a point. However, for latency sensitive workloads, it is better to provide a target latency and to stop increasing the load (outstanding IOs) on the array beyond that point. The latter is the key observation that PARDA is built around. We use a control equation that uses an input target latency goal beyond which the array can be considered to be overloaded. Using our equation, we modify the outstanding IOs count across VMware ESX hosts in a distributed fashion to stay close to the target IO latency. In the paper
, we also detail how our equation also incorporates proportional sharing and fairness. Our experimental results show the technique to be effective.
Submitted by irfan on Tue, 2009-01-13 00:56.
Ajay Gulati, Carl Waldspurger and I have just finished work on a distributed IO scheduling paper for the upcoming FAST 2009 conference. So I wanted to provide an update. PARDA is a research project to design a proportional-share resource scheduler that can provide service differentiation for IO like VMware already provides for CPU and Memory. In plain terms, how can we deliver better throughput and lower response times to the more important VMs irrespective of which host in a cluster they run on.
This is a really interesting and challenging problem. A bunch of us first started brainstorming in this area 2 years ago but despite several attempts, for over a year, we couldn't come up with a comprehensive solution. For one thing, IO scheduling is a very hard problem. Second, there aren't existing research papers that tackle our particular flavor of the problem (cluster filesystem). To top it off, the problem sounds easy at first blush, encouraging a lot of well-intentioned but ultimately misleading attempts.
Ajay and I first published a paper on our idea to use flow control (think TCP-style) to solve this problem at the SPEED 2008 workshop in Feb '08 and the feedback from the research community was encouraging (this later became the basis for an ACM SIGOPS Operating Systems Review article, October 2008). Since then Ajay, Carl and I have worked out the major issues with this new technique resulting in the FAST paper.
The paper is entitled "PARDA: Proportional Allocation of Resources for Distributed Storage Access".
Submitted by irfan on Thu, 2009-01-08 02:42.
I'm celebrating my one year anniversay in a new team at VMware so I thought to update everyone about it (albeit a year late). Having worked in the Performance group for 5 years, it was time to try something new. And the new line of work is in the kernel group at VMware where I'm part of the Resource Management team. I started in this team middle of January last year.
It's been a blast so far. I am working primarily on disk scheduling and memory management algorithms. I'll be posting more details on some of my projects here shortly.
Submitted by irfan on Wed, 2008-04-02 09:12.
I published an academic paper at the IEEE International Symposium on Workload Characterization (IISWC 2007) in September that I want to spend some time talking about. The paper was entitled "Easy and Efficient Disk I/O Workload Characterization in VMware ESX Server". Here's the abstract:
Collection of detailed characteristics of disk I/O for workloads is the first step in tuning disk subsystem performance. This paper presents an efficient implementation of disk I/O workload characterization using online histograms in a virtual machine hypervisor-VMware ESX Server. This technique allows transparent and online collection of essential workload characteristics for arbitrary, unmodified operating system instances running in virtual machines. For analysis that cannot be done efficiently online, we provide a virtual SCSI command tracing framework. Our online histograms encompass essential disk I/O performance metrics including I/O block size, latency, spatial locality, I/O interarrival period and active queue depth. We demonstrate our technique on workloads of Filebench, DBT-2 and large file copy running in virtual machines and provide an analysis of the differences between ZFS and UFS filesystems on Solaris. We show that our implementation introduces negligible overheads in CPU, memory and latency and yet is able to capture essential workload characteristics.
Submitted by irfan on Tue, 2007-10-02 11:13.
It seems a lot of people agreed with my previous post on the security of virtual switches. These include the originator of the information that prompted my blog post. Chris Wolf himself posted comments recognizing his misunderstanding. I think Chris did a great job of quickly following up after my blog post and getting in touch with us to resolve the misunderstanding. Kudos to him. Read his comments for yourself.
Submitted by irfan on Mon, 2007-09-24 17:30.
In an article titled VMware dispels virtualization myths, Bridget Botelho wrote:
"One significant issue with virtual machine security is with virtual switch isolation," said Burton Group's Wolf."The current all-or-nothing approach to making a virtual switch 'promiscuous' in order to connect it to an IDS/IPS is not favorable to security."
For example, "if you connect an IDS appliance to a virtual switch inpromiscuous mode," Burton said, "not only can the IDS capture all of the traffic traversing the switch, but every other VM on the same virtual switch in promiscuous mode could capture each other's traffic as well. "Users should be aware of this and work around it."
Submitted by irfan on Fri, 2007-09-21 05:47.
Typically aggregate statistics like mean, median and standard deviations from mean are not detailed enough. Furthermore, they can be misleading! See figure where the mean of the data is 5.3 but the we can clearly see from the histogram that a bimodal distribution is in play. There isn't even a single data point close to the mean. Granted this particular example is contrived but it is not far from the typical situation in the real world. One example is if reads to a device were hitting the device cache only about 1/2 the time.
Submitted by irfan on Fri, 2007-09-07 13:38.
Next week, I'll be attending VMworld 2007, the virtualization community's annual conference. Actually I won't be the only one given that more than 10,000 people are attending this year! Furthermore, famous people are keynoting:
- Diane Greene, VMware's CEO
- Mendel Rosenblum, VMware's Chief Scientist
- John T. Chambers, Cisco Systems, Inc.
- Patrick Gelsinger, Intel Corporation
- Hector de J. Ruiz, AMD
This year, I'll be giving two talks: "Fast and Easy Disk Workload Characterization on VMware ESX Server" and "ESX Storage Performance - A Scalability Study".
Submitted by irfan on Sun, 2007-07-22 19:31.
VMmark 1.0 is now live.
This release is the culmination of lot of effort by the VMmark team. Big congrats to them to get this tool available to the virtualization community.
Having worked in the virtualization performance arena for about 5 years, I can attest to the need for standard benchmark. There are so many variables due to the layers of software running on layers and then resource sharing on top that it is very difficult to make sense of data presented by customers, partners, the press and the community at large. VMmark attempts to address this by creating a standard benchmark.
Submitted by irfan on Thu, 2007-07-12 07:27.
I found Micorosft's recent Live Migration related announcement rather puzzling.
The Register published an article "Microsoft promises VMware beater despite reversals" that everyone should read. You'll get a laugh out of it. I sure did and it started my morning off just right!
Instead of providing details of this bogus idea of "quick migration", the Microsoft rep takes the press to task:
"The recent press has been inaccurate to say we don't do migration - we do migration: quick migration"
Submitted by irfan on Sun, 2007-04-08 08:34.
The anxiously awaited RHEL5 was released recently. The following is from RHEL5 release notes. Looks like virtualization support (with Xen 3) is only half- or perhaps only quarter-baked.
"Fully virtualized guests cannot be saved, restored or migrated."
So, does that mean no suspend/resume and no VMotion?
"Hardware-virtualized guests cannot have more than 2GB of virtual memory."
One really shouldn't try putting Enterprise or even SMB workloads with only 2GB upper limit.
"When you install a fully virtualized guest configured with vcpus=2, the fully virtualized guest may take an unreasonably long time to boot up. To work around this, destroy the slow-booting guest using the command xm destroy and then use xm create to start the same guest afterwards."