SAP Conversational AI is almost 1 year old! And the infrastructure changed a lot since we’ve started. We started with Kimsufi, then ran with DigitalOcean and we are now on Azure and AWS. The always growing need of availability, performance and security is the main reason of these evolutions. As a former student without knowledge in DevOps and infrastructures, I had to learn everything from a single-server infrastructure to a tens-servers infrastructure. I’ll expose here how we evolved, made technologic choices and build a scalable, High Availability infrastructure at SAP Conversational AI!
Our first infrastructure
At the beginning, we had all the applications, both development and production environments on the same Kimsufi KS-4 which costs us 22$ a month. We used it for 6 months, developing our product, testing a lot. After the first 3 months, we’ve started using Docker with Rancher. This way, our production and development environments were split and not impacting each other. This was great for our first users. But when some containers started randomly crash at the filesystem level in production, we had a problem. We dug, searched for real use case of big compagnies running with Docker in production but after some time not finding any solution we’ve concluded that Docker is currently not viable for production.
That’s why we searched for a hosting company allowing us to create instances in the cloud to entirely separate the different elements of the platform.
Finding a new hosting provider
Even if we started with a single server, we’ve always designed our code to be scalable. In fact, every single part of the platform is API-oriented, split in microservices. As a result, it was easy for us to move to a cloud architecture with many instances.
To find the hosting company we needed, we first designed the infrastructure. Here is what we needed:
- Handle 10 users at the same time on the platform
- Enable us to resize manually some instances to handle a temporary burst of 100 users at the same time on the platform (ProductHunt was coming 😉 )
- Recover from a downtime within 15 minutes
- Production and development environments
- Continuous Integration and Continuous Deployment
- A centralized log system
Considering these informations, we established the features our new hosting provider must have:
- Use cloud instances to have each part of the platform running autonomously (Frontend, Backend API, Machine Leaning, Natural Language Processing,…) and to be able to pop a new instance if needed (scale or failure recovery).
- Have resizable ressources (CPU, RAM, SSD,…) to be able to scale when needed
- Be cost efficient. We were a young startup and didn’t have a lot of money!
After some researches, DigitalOcean seems to be the most interesting option. AWS and Azure were yet too complicated options for our needs, Heroku is too high-level and we needed to create an on permise infrastructure. OVH, 1and1 and others companies mainly provides hardware servers, which is not agile enough for us.
DigitalOcean provides a really simple solution to move to a basic cloud infrastructure: Public ip addresses, private network, resizable instances (when shut down), snapshots and a big community and blog to help!
Here is the new infrastructure made with DigitalOcean:
Everything is in a single private network, except a reverse proxy which serves the platform and tools to the users and developers. We moved our Jira and Prerender to their PAAS services, our blog to a simple OVH web hosting, and added StatusCake and Newrelic to monitor the platform. On top of the Production and Development environments, the log, backup and monitoring system took care of the platform.
The next steps
So we had a fresh new infrastructure able to handle our growth and we kept it for almost 4 months. But now we are coming to its limitations: there is no internal DNS server, so no service discovery. We can’t pop new instances automatically. The infra was hard to update and to maintain because the servers were managed manually. With new business contracts, we need to be able to instantiate the platform for companies easily. Bot hosting is coming, meaning auto deployments and auto scale. We also have a lot of traffic now, so we have to scale and provide a High Availability service, which means duplicate all instances and have them in a physical different location (different network and power switches). Last but not least, we need more computational power to compile our datasets.
Hence, all of these needs brings us to the next level: a new infrastructure running on Azure, Google Cloud Platform or AWS.
We’ll describe our migration to a new IAAS service in the next article, coming in a week! We’ll talk about our infrastructure choices, its design and deployment, along with the challenges we came through and the use of Ansible. See you next week! 🙂