I poured sparkling water into a tall glass and continued to listen. The founders of a start-up I was about to join had been speaking for about an hour, offering me a general overview of their business and technology. Now the conversation had turned to current challenges
The company's mobile apps for location-based real-estate search and discovery were innovative and compelling. Business was growing and the company had signed deals with several major national real estate franchises. As more real estate agents installed and started to use the apps usage was growing, but so were trouble reports, and it appeared that the backend system powering the apps was unstable. The business growth had outpaced the technology infrastructure.
Wanting to get a sense of the landscape, I began to ask a few questions. In addition to the mobile apps, developed natively to the Android and iOS platforms, the backend was a fairly rich J2EE application, supported by an Oracle 11g database. There was also a complex data ingestion system that pulled in data from real estate MLS (multiple listing services) all around the country, every day.
To grow the business the company needed to continue adding MLS data sources as well as enhance the core backend functionality while introducing new apps with more features. Searching for a home on the apps needed to be fast and reliable. And right now, it wasn't.
The current system was running out of gas. But why? Was there a capacity problem with the servers? Could we optimize the code to be more efficient? Should we re-design the application messaging protocols between the apps and the backend? Would I be able to help?
I leaned back and thought over all I had learned. I roughly estimated the system's traffic levels and considered the likely sources of performance and availability problems in this environment. "Yes, I can solve these problems."
At our next meeting more details emerged. The system was hosted at a conventional colocation facility. There were about a dozen physical servers for production, scattered storage arrays, a couple of cascaded switches, a single 100 Mbps network connection to the Internet, and a load-balancer. I asked if they were familiar with "the cloud."
As with so many other verticals, the real estate industry had started to embrace technology in the 80s and 90s. By now the established players were fairly comfortable with their data centers and systems. The "cloud" was viewed with suspicion. "How does the cloud work?" "Will our data be secure?" "What about reliability?"
There was much work to do, not just technical, but also in building a comfort level with the cloud in general. I needed to show how the cloud could add real value for the business and its customers. The opportunity to do that was about to present itself.
New clients were being signed with more in the sales pipeline, and the business had committed to 99.9% availability, which we had to demonstrate to existing and prospective clients. By now my engineering partner, David Glaser, and I had begun to assess the sources of production systems instability, and to formulate plans of attack.
Our high-level assessment was that our current infrastructure was too limited at the current service levels - and that remediation would require a major capital investment. Worse, the cost to take the infrastructure out to a growth projection for three years would be prohibitive.
Our prior cloud experience gave us a comfort level in projecting that an end-state of 100 percent cloud would be beneficial, with one minor exception -- we had never before moved IO intensive workloads with Oracle Databases into the cloud. However a number of Amazon's moves led us to believe that this would work, if not now, but in the near future.
So we set out on a path of reducing the load on our current infrastructure through a Cloud Pilot, followed by a full Forklift. Here, because of the database risk, we would partner with a third party vendor.
Under certain load conditions poor management of resources within the java application were causing pathological server behaviours. In time we would fix these. For now, we needed to reduce server load to handle increased traffic.
One of the most frequently used backend transactions obtains a property photo, resizes it to meet exact app requirements, and then sends it to the mobile app. This particular transaction was prolific, fairly heavy in use of server CPU and network resources, and otherwise stateless relative to the rest of the backend system. I presented a proposal to build a new photo resizer subsystem in the Amazon Cloud and to move this transaction out of our physical infrastructure, all in a matter of weeks. This would be our Cloud Pilot project.
We designed and built a cloud-based resizer farm, running it on four EC2 compute-optimized instances, and using ELB for load balancing. Today it processes and delivers over one million real estate property photos per day, with peaks as high as 50 photos per second. Performance is excellent. Users looking for homes on their phones get sharp and properly sized images quickly.
This pilot project delivered tangible business value while helping stabilize the legacy infrastructure even as we added more users to the system. For the first time the company could see the benefit of cloud computing. The stage was set.
By now our engineering analysis of the backend system had progressed substantially. We had identified and fixed a number of java code defects that led to instability and performance degradation. Fixing these problems allowed us to focus upon underlying infrastructure capacity issues. Our physical load balancer and application servers were underpowered, and the RAID arrays supporting our Oracle databases could not keep up with the IO demands of our system.
I went through the exercise of sizing and costing a physical infrastructure upgrade that would meet our needs. It was not the approach that I wanted to take but I knew it would be needed for comparison to a complete cloud proposal. To ensure that the project would be tightly focused and delivered on time, I chose to define it as a forklift - a complete, one time migration of all backend functionality out of physical infrastructure, into the Amazon Cloud. This was a bold approach. The generally recommended architecture at the time was a "hybrid" design. In a hybrid design, the application servers would move to the cloud, but the core databases would remain in physical infrastructure. The proper design for our core Oracle databases would be critical to our success. I reached out to Oracle.
Having partnered with Oracle before I knew the value of engaging their engineering team early in the project. I presented an overview of the business and application, along with more detailed metrics of our database systems. We sized the databases for use with Amazon EC2 cluster-compute instances, and Amazon EBS volumes with provisioned IOPS.
Our vendor helped write the migration plan and proposal, and I presented it to the business. I also presented to the board of directors and made the case for "cloud." Our cloud proposal would avoid costly capital investment while positioning the company for significant growth. The elastic nature of Amazon cloud services would ensure the system could scale as needed. Approval in hand, we set out to build the new system.
We began implementation by building a prototype system comprised of one application server and database. To create the database we began copying our nightly RMAN database backups to S3 cloud storage. Once that was in place our DBA team was able to restore the backups and stand up copies of our production database in the cloud. For the first time, we could point our Smart Phone apps to the cloud and run our applications.
Controlling application endpoints would be key to our launch plan, and we used Amazon Route 53 to provide DNS services for our EC2 servers. We power thousands of uniquely branded Smart Phone apps, each with their own domain name. We also used Route 53 to manage application endpoints internal to the system.
Many of our data source providers used IP white lists as part of their mechanisms for authenticating us. We used Amazon Elastic IP (EIP) to address these requirements.
We built a cloud dashboard with key business metrics as well as technical metrics. The dashboard was powered with data collected from application logs as well as operating system statistics. We used Splunk to visualize and present this data in real time. The dashboard would be key to the live cut-over as well as the ongoing 24x7 operation of the system.
Our launch plan, of course, included a "fallback" scenario (to go back to the legacy system in the event the launch failed). To minimize this outcome I asked the team to perform several launch "rehearsals." These allowed us to fine tune the detailed step by step plan, and also to establish reasonably accurate timelines for each step.
Finally, we were ready to put our plan in motion. We provisioned our production grade servers in Amazon EC2. On launch day we restored the database from our RMAN backup. We then waited until after midnight, when our traffic levels were minimal, and shut the legacy servers down.
A short time after that, the last database updates from Oracle archive logs were applied to the cloud database. We were ready to turn the cloud system on. With the DNS changes in place, we watched our dashboard as live traffic began to flow into our new system. The total outage time for the migration was under two hours.
Our cloud system has performed extremely well from the time it went live. Because we had built rich and detailed monitoring, we did uncover a few unexpected issues that we were able to quickly address. We currently run a farm of three EC2 general-purpose instances for our application servers. (And we know we can easily add more if needed.) Our core Oracle database is stable and performs well under load, and Amazon recently doubled the number of provisioned IOPS available in EBS, so I will not be losing any sleep over future needs in that dimension either.
Since our journey began, the number of daily unique users has more than doubled, and we have added tens of thousands of real estate agents to the system. Instead of worrying about old and dust covered servers in a rented cage, I can log in to the Amazon Web Services Management Console and control our entire infrastructure from a web browser. Life in the cloud is good.
Our technology and team are now well positioned for growth. With a stable and scalable infrastructure the team can focus on building new features and capabilities as we grow our client base. We continue our rapid expansion. Our infrastructure is no longer a bottleneck. In fact it has become a major selling point. This cloud journey has been a great success.
We hope you have found this story engaging and useful. Having worked for many years to resolve performance and scalability challenges in a variety of industries, we are now demonstrating that cloud technologies change the paradigm and introduce new opportunities for business growth. If you would like to discuss further, please contact us at the links below.
David C. Willen