With the growth of SAP Conversational AI, the DevOps team has to find solutions to scale our entire infrastructure. We’ve started redesigning it entirely so we can support our increasing traffic. Being at Digital Ocean conforts us in our idea that micro-services are the best choice for us, but we need to go further. Right now, IAAS (Infrastructure-As-A-Service) makes sense for us. So, we’ve started comparing major cloud companies.
Infrastructure choices and design
We’re taking a look at Google Cloud Services, AWS (Amazon Web Services) and Microsoft Azure. Our infrastructure is split to really take advantages of each cloud services strengths.
And it’s official: our platform will be hosted on Microsoft Azure!
I mean, we’d love to. But we have to learn how Microsoft Azure works… Before starting to create new instances of servers, we read lots of documentation on Microsoft Azure Documentation Center. We need to understand how a virtual network, a network gateway, a network security group, an high availability group or each component works on Azure. Here are our takeaways:
Each component of an Azure infrastructure is called a « resource ».
First, we decide to split our environments into different Resources Groups, which are virtual tools organizing resources. We create one for the production environment, one for the development environment, one for our tools (gitlab, gitlab runners), etc.
Now our resources are at their right places but they need to communicate with each others. To do this, we create a Virtual Private Network for each resource group. A virtual private network is a network where you can connect some resources under the same IP address block (for example: IP addresses from 10.0.0.0 from 10.255.255.255. So in this virtual private network, we can have 256*256*256 = 16 777 216 resources). Now, resources in this virtual private network can communicate with each others.
But by default, external connections to resources in this virtual private network are not allowed. The only entry point in our infrastructure is our Reverse-Proxy. With it, each connection on *.cai.tools.sap is routed directly into the right resources group, to the right resource.
Then, we split each virtual private network into sub-networks. We decide to have one sub-network by kind of servers: One for front-end servers, one for back-end servers, one for data servers, etc… A good thing with sub-network is we can allocate IP address blocks to each ones. For example: The front-end sub-network has IP addresses between 10.0.1.0 and 10.0.1.255 (in this sub-network, we can have 256 resources). The back-end sub-network has IP addresses between 10.0.2.0 and 10.0.2.255 (256 resources too). It’s the same for others sub-networks.
To provide an high-availability for our services, we decide to create high-availability groups (HAG) for all our servers. We have an HAG for each kind of server: one for front-end servers, one for API servers, one for machine learning servers, one for database servers, etc. A HAG can append a new server if one of the group crashes, programmatically or physically. So, we constantly have the same number of active servers in our infrastructure.
One of our goals is to provide a high availability for our services but to also provide a good response time and stand traffic. To do that, we choose to put a lot of instances for each kind of server (so, in the same HAG). And, to divide the incoming traffic between servers, we put a Private Load Balancer in front of each kind of servers. It splits traffic according to rules we prior decided (like CPU consumption or numbers of connection already active in this server).
So with HAG and Private Load Balancer, we can manage more traffic than before with the lowest downtime ever!
Take care of your infrastructure… with Ansible
After thinking and designing our new infrastructure, we look for tools that would help us manage the deployment process and set up new instances, but at the same time would keep managing already created instances too. And we found Ansible!
This is one of the first sentences of Ansible documentation :
“Ansible is an IT automation tool.”
By using Ansible, we can install each one of our servers on-the-fly. It works with *.yml files and python-based commands. We versioned ansible files on a git repository. We don’t use Ansible to create new resources on Azure yet but it’s possible.
To begin, we make an inventory of our infrastructure. We list all our instances. We save this inventory in the directory « /inventory ».
Next, we organize them into groups. We decide to create several different groups to match our needs (production group, development group, front-end group, back-end group, etc…). It’s very powerful because if an action needs to be executed on all servers in development group, we call it for this defined group. It’s the same for all servers in front-group: regardless of the environment group they are in, each server of the front-end group will be affected by the execution of actions if we call it on that group!
A very useful case to use groups is to install Node.js. We need Node.js on each front-end server. To do it, we just have to execute to the right group (ie. front-end group) the action which install it!
In Ansible, an action is called a role. A role is a task made to execute a defined action, like add ssh authorized keys in a server, install Node.js or add configuration files of each ones of our git repository to the good server. So, we write dozens of roles, matching our needs. We save them in different folders, in the directory « /roles ».
In the end, we create some playbooks. A playbook executes predefined role(s) on predefined group(s). That’s the link between roles and groups! Playbook are saved in « /playbooks » directory.
For example, in our front-end production playbook, we have to execute some roles to have a fully working server:
Our playbook executes roles on all servers owned by ‘front-production’ group. Roles executed are ‘debian-base’, ‘nodejs’ and ‘js’. More than roles, a playbook can execute tasks too. Here, we install pm2 to manage our Node.js applications.
We have all tasks we need to set up servers so we can execute playbooks. The command line is :
ansible-playbook -i inventory/my_inventory.txt playbooks/my_playbook.yml
Now that we have our infrastructure on Microsoft Azure and our servers with Ansible are set up, it’s time to migrate our applications on new servers.
For several months, we add Continuous Integration (CI) into all of our repositories. This helps us deploying our work at the right place in the infrastructure. By using CI and micro-services, the migration is a lot simpler.
We use Capistrano to manage our deployments.We decide to migrate services one-by-one, testing them and then proceeding to migrate the next service if all is well.
The migration begins with the development environment. First of all, we change Capistrano development configurations for back-end repositories. When all is good, we do the same with front-end repositories. Then, with all repositories on production configurations. And then, we migrate our tool environment (gitlab and runners).
Steps-by-steps, our infrastructure switches to Microsoft Azure.
We take the time to clean configuration files, clean Capistrano commands and everything allowing us to have a better continuous integration.
After days of works, it’s time to migrate DNS. We knew it would take an undefined time if we didn’t transfer our DNS resolution to Microsoft Azure, so we did it days before. After that, the last thing to do is changing the public IP address of our domain name.
And that’s it! Our new infrastructure is now running on Microsoft Azure. It’s scalable, with a good response time and an high availability. And with Ansible and our CI, it’s easy to manage servers, repositories and deployments. What a beautiful life for a DevOps guy !