Cabin Coder

October 06, 2019 by Ronnie Kwok in Technote

Introduction

I enjoy the feeling of being able to do geeky stuff at remote location - be it on the mountain after a hike, a small table on the train or in a cabin on the sky!

The feeling of constraint by the machines or the bandwidth available makes the experience even more interesting, in a condition that I could get my work done!

So, here I am, in the cabin up on the sky, preparing this little experiment of building a blog and writing several entry through the 10+ hours.

Experiment

Come to think about the criteria, better said what I am able to do. Something I am able to develop on my iPad (don't want to rely on remote development, first is about the cost and secondly want to test out what's the limit of local development) and latency for sure.

Regarding to the topic, what would be interesting in this 10 hours are the happening over this period of time? Blogging sounds like a good idea as I can capture this journey and all the things happened in this 10 hours. Besides, maybe I can build a blogging workflow, together with the blog entry which seems to be an achievable thing - not too technical but complex enough to try out the limitation.

There are two options I am considering,

Use Pythonista and install a framework as a base to finish my workflow
Building a static website using Ghost / Jekyll

The static website feels more interesting to me and this set the stage of the experiment.

Constraint sets a boundary but it is also a fuel on creativity. We have to think and thinking it hard to get around, or even over, the constraint in achieving our goal.

Provision a server in the sky, NOT

Originally, I would envision myself setting up a workflow in iOS to provision a new server through Digital Ocean and then automate the setting up of the environment with Ansible script or Dockerfile. But I ended up going back to Netlify.com, leveraging their free hosting for this project.

The free plan is good enough for this experiment, and it already comes with,

SSL certificate through Letsencrypt
Load balanced DNS through the NS1 infrastructure, another company that I enjoy using for both the performance and feature (site monitoring and notification in particular).
Integration with GitHub which ease my publication and modification workflow
Deploy globally with CDN acceleration

And since I have already registered my own website (https://ronniek.me) in Netlify.com and thus, I can just add a new site and point it to my GitHub repo.

Domain Creation

I could stay with a domain provided by Netlify.com but having a custom domain would make it more complete. And "Freenom" (stand for Free Name) offers free domain ended with TLD as ".tk" for up to 12 months of usage. It's more than enough for me.

So the naming process begins. Someone coding in the cabin so it's natural to use "cabin-coder" as the domain. Signing up a domain in Freenom.com is also straight forward, sign up an account, pick the name and also the duration then ta-da, all set. Since I am using the name server of Netlify, I need to update the entry in Freenom.com as well.

All set for the domain creation.

Finding a theme, setting up the repository and write!

Paid theme on Jekyll Themes is nice (https://simples.jekyllthemes.io) but shelling out $49 for an experiment is a bit too much so I am picking a free theme which seems to work for this experiment.

Update : The more I look at it, the more I like the "Simples" theme. So yes, I shelled out the money, download the zip and save it to File. Next, I need to add "Working Copy" as a Locations to access within File. From there, I can then copy the themes to the repository created and start working!

Update #2 : The project has gone live and check it out from this link

Landscape of Backend Development

May 01, 2018 by Ronnie Kwok in Devlosophy, Technote, BecomingDevOps

Background

It happens when I was having the regular 1:1 with the all mighty frontend developer and he shared the challenges of being a frontend developer nowadays, and also the difference between now and in the early 2000.

Framework are releasing quickly; either as new version or as an all new framework. Developers are now expected not only capable to know the basics - HTML, JS and CSS but also specialized in using framework to develop application. And here’s the thing, every framework is different and once you get over the basics and turn your focus into performance, you would need to know the framework inside-out. All takes a lot of time and effort.

And the last statement in the conversation is, the principle for backend development has not been changed and they (thebackend developer) would not need to deal with all these issue rooted from the frontend framework.

I give it a little thought as I think while his statement is somewhat true, new framework for “backend development” (I am mainly referring to the common languages - SQL, Java, .Net, Ruby, Python, PHP) are not popping up as frequent as framework for frontend development. But the way backend application develop has indeed evolved quite a lot.

And let’s see what it is.

History

Before we start, let’s give the discussion some context. The landscape of development that I talk about will be spanning from the time when the first browser goes mainstream (i.e.Netscape around 1994) until now (i.e.2018). Besides, the application I talked about would be focusing on “online application”, i.e. application that serve users on the public internet. A rough timeline is bound to each of the stage and it is by no means a hard and definite time split for each stage. You should be seeing it more from the perspective of progression.

From static to dynamic (1995- 2000)

Since the original web server developed by Tim Berners-Lee, static pages started popping up. Soon after, people would like to add dynamic element to the page and a group of people worked out an approach to execute program on the server which generate web content dynamically. Perl was one of the popular language in developing such “program”, which a more precise definition should be “script”, or commonly known as “Common Gateway Interface” (CGI). The script is sitting on the serving host and is triggered remotely through web request and return the result of the operation as a response. It opens a lot of flexibility but also exposing risks since it allows remote execution and direct manipulation of the resources underneath. Access control is conducted on the operating system level by limiting access of the user that execute the script on behalf of, or through mechanism like “chroot jail” which isolate a process from the rest of the system.

In terms of web hosting, there are not much option - either owning the server and managed it by oneself (or through services from the hosting company) or procuring web hosting services which is usually sharing web servers with other subscriber.

Sandboxing and virtualization(2000- 2005)

The interest in the Internet growth in a rapid pace, there were people seeing the opportunity which generate needs for new technology to cater for the growth.

Since 1995, a new language targeted for internet programming was developed, with focus on security, performance and architecture. This language is also promoting the idea of “write once, run everywhere” since it aims to deploy the program on different device, ranged from computer server to set-top boxes. This language is JAVA and it has helped to define a newer approach to developing and operating internet application. At 2000, it starts to pickup the momentum after the first inception since 1995.

JAVA application would first compile to a platform independent byte code (hence the support of write once, run everywhere) and executed within the java virtual machine (JVMis indeed platform dependent). This approach allows portability but also has an added benefits of tightening the security.

The security model in JAVA consists of several layer, each layer is serving as a gateway and is responsible for safeguarding a particular area of focus. It can be summarised in the following order,

Source code schema verification
Class level security (from where could compiled class to load from)
API level security, for which operation does it allow to be executed
Security management, there includes authentication mechanics (against other system) and also the security protocol that would be using (algorithm, hash key format etc).

JAVA is not only for developing application running on the backend but also as a component executed in the browser for interacting with user, in the form of JAVA applet. The same security model apply.

Unlike the CGI approach, hosting the JAVA application involve the java virtual machine, a “middleware” application server that provides the JVM container environment for the JAVA application to execute upon. What’s more, the MVC architecture pattern is widely implemented. This architecture pattern promotes the separation of business logic, flow control and business domain model into separate layer / component.

Since the application would now involve more components and in turns, require more computing resources and thus, application is usually hosted in dedicated server. Either in the form of bare metal or embracing the budding idea of virtualisation - partitioning a physical server into many “virtualised” machines. Instead of putting everything into a single machine, each component will now sit within a dedicated environment. Managing the application is now more complicated.

Centralize to distributed(2005- 2010)

Internet keeps growing after the Dotcom burst - smart phone and mobile internet are gaining popularity, website are mashing up content from other websites through RESTFul API. The demand for a faster delivery and the change in the the channel / approach on internet application results in developers looking for solution that allow them to move faster and provide support to ease the development of the current trend.

There are benefits in adopting the MVC architecture but the JAVA approach is too “heavy” - in terms of the architecture requirement and the skill set involved (installing and configuring a Weblogic middleware is not a task for everyone). On the other hand, the application development using scripting language has the benefits of increased agility (changes can be introduced without compilation, deployment and restarting of server) and development framework for the different languages starts picking up the momentum as well. The development framework provides not only a “framework” but also a methodology on how application can (or should) be developed, following the “best-practise” suggested. The separation of concern, configuration management, data access layer, error handling - these are some of the functionality which the framework can usually provides.

Unlike JAVA, the execution of these languages does not involve the setting up of application servers (e.g. Jetty,JBoss, Tomcat), which means application is easier to setup and manage. On top of this, the landscape of “Cloud Computing” has changed tremendously; more affordable (bill by usage rather than upfront investment), speedy provisioning and consume the needed components as a service. This is the first time that a developer can easily get access to a production grade environment without a heavy upfront investment (someone would argue VPS can do the same but one would need to do all the hard-wiring to stitch to the level of resilience that a cloud provider can provide up-front). Besides, company like Heroku is pushing the envelope further, that shields the underlying infrastructure which allow developer to focus on developing their apps.

Code management has some new breath as well, GIT is the new player and decentralized version control is becoming the norm. Developer can start creating local branch, committing code locally and merge back later. This provide an even bigger freedom for the developer to experiment, make changes and rolling out releases.

Application scaffolding, GIT and Cloud platforms are the main drivers on increasing the development agility and promoting application architecture design that could leverage the elasticity that these services offer. This development methodology is being summarised as the “Twelve Factors App”.

Concurrency and Serverless (2010- NOW)

The proliferation of smart phone and mobile internet increases the population of internet user. This brings up the need of concurrent access than ever and a newer approach to deal with the increasing demand is also surfacing. Programming language that focus on concurrency is gaining popularity - Golang, Scala, etc. Backend developer who is used to stateful programming will need to design their application that works stateless. They would need to change the mindset of having the code to work on one machine which would now be running across multiple machines.

The cloud platform is becoming a commodity and infrastructure can now be handled like code through the use of API, the line between development and operation is getting blurry. Infrastructure can now be provisioned and versioned through API and configuration management tools. This advancement ease deployment and streamlined the development/deployment pipeline further. If 10 deploy a day is not enough, it can now be done continuously (CI/CD).

On top of this, a newer approach to allow partitioning of multiple application on a single host, running as if they are in their own entity starts showing up. This approach is named as the container approach and each container will have their own namespace. They works without the knowing of the other container and since it doesn’t involve a hypervisor, resources is better utilised than a virtual machine. Container can also be moved across environment and this enable developer to setup a production-like infrastucture on their machine easily.

As you can see, developer no longer just need to know the programming language, framework and design pattern but getting more and more involved in system level component too.

Conclusion

No matter if you are coding for the frontend or the backend, the magnitude of changes over the years are just as enormous. It doesn’t seems to be slowing down and on the contrary, changes are happening even faster. Developer would need to know even more than before in getting the jobs done. But at least it is an interesting journey.

Looking back, a year of the DevOps practise

January 13, 2015 by Ronnie Kwok in Technote, BecomingDevOps

After sailing through the turbelence since launching the site, I can finally find the time to sit back and look at the situation. A lot has been changed and the key changes are,

The team is getting bigger
Development is getting even more rigorous
Higher expectation from both internal and external user on the platform

But the way we work was still pretty much the same - manual and reactive (aka ad hoc). The process was not sustainable and could not follow the pace of the project.

More than two hands can handle

Since launch, the backlog kept growing with bugs or features that we deferred to be done “in the future”. We started splitting the team into two, one focusing on bigger chunk of work which we call it “Road Map” and those that can fit into our bi-weekly release were called “Fast Train”. Both needed to go through the environment promotion upon test successful and thus, the number of deployments almost doubled.

Besides, we started to engage works with a new vendor which they needed to run their code on a new platform. Apparently, they have their own branch on our version control, too. And thus, making sure we were building from the right branch, deploying to the right environment by pure hands became a challenge that we stand no chance to win.

Communication is always the key to success

The site was gaining popularity and issues were caught by users quickly. Ensuring uptime was no longer sufficient as we need to increase the SLA. We had collated procedures to deal with different scenario and many of them still require data generated from manual investigation, e.g. check the system performance, read the log, etc.

At the beginning, it’s quite panicking as we didn’t know what went wrong and usually need someone (me most of the time) to give out direction on what to look into. What’s more, there were no guideline to determine if the components were behaving normally.

And as the team grows, we can foster the separation of job duties. We were now having discreet “application” and “operation” team and that means, people talking in different linguistics need to work together.

Rhea - means “Flow” in Greek

So I started to think on how to enable us to work along our flow, like a stream of river, and I came down to three principles,

Automation
Communication
Culture

To free up people from the never-ending list of tasks, and enabling them a breathing space to think, or to talk. Getting things done automatically was crucial. It helped to eliminate the dependency on individuals which could save time and reduce manual errors.

We list out things that we were doing repeatedly and deployment stand out from the crowd. Naturally, we start off to automate this process. You may have something else on your list, but the key thing of the exercise is to identify what’s stopping you.

The deployment automation is a huge success as it turned around the way we manage the environment for development. Before we have the process in place, QA team would need the help from Dev to prepare the build, the Ops to do the deployment but now, they can do both by themselves.

Next, is making communication more efficient. Communication is about the way we express, as well as the “language” we use. To be precise, it’s about enabling people in different team to interpret the vocabulary. For instance, when we were told the site is slow, we usually need to go through rounds and rounds of discussion to understand what is it meant by "slow" - is it the response time or is it about the time it takes in receiving a system email? Besides, it also takes several rounds of investigation in determining who is responsible.

With the information gathered from monitoring agent (please refer to “You know as much as you monitor”), we are able to break down the system into different components and built dashboard around, showing performance of individual component. Both team can then look at the figures and knowing who can help looking into the issue.

We’ve sorted the vocabulary and I was trying to promote the use of group discussion application like CampFire / HipChat. I was amazed by the Hubot integration provided by both and implemented some custom command for looking up certain information through the message client. I thought it would be adopted but turns out it’s not the right thing to do.

The team were already using Skype in the office and Whatsapp on the go. Not everyone in the team need the Hubot integration and they have already created different group on each channel. They don't have issue in talking!

The lesson I learnt, is when they are already talking, you shouldn’t change that unless something is broken. Don’t get carried away by technology, focus on the intent.

Culture is the end goal

We continue to automate more of our work, and to provide a richer set of vocabulary. From there, I am beginning to see something beautiful.

We are not only getting more time out of automation, and enabling the team to communicate effectively, we are bringing the team together through the process. The process enable the people to understand one another’s work, especially their difficulty and needs. Decision are now driven by data and everyone can understand the bigger picture and the relationship of their work with others. These increase the sense of autonomy and we are now extending this culture to the business team.

The journey is not going to stop right here and the last thing I want to share is the manuscript I wrote 2 years ago. And I hope this article gives you some thought on how to grow your own DevOps culture.

You know as much as you monitor

December 17, 2014 by Ronnie Kwok in BecomingDevOps, Technote

Dashboard, or HUD, is where you can find operation metrics. To no surprise, dashboard was firstly introduced for the automotive. Values from the speedometer, fuel gauge, tachometer gives the driver a sense of “what’s going on”. Imagine you are driving a car without any of this value, you will worry if you are driving too fast, or running out of gas at any time.

Let’s draw a line on the sand

The invention of the dashboard is not only providing a means for displaying the important matrics, but also forcing the engineer to quantify the operation into numbers. This number also becomes a standard language between the user and the engineer, reducing the discussion over something abstract (fast versus slow for instance). It also provides precision, instead of a rough guess based on experience.

Like we have just mentioned, we want to understand “what is going on” by glancing at the dashboard. This cannot be done by the value alone, what I mean is, 100km/hr by itself is not indicative if it is fast or slow. The value will only start showing it’s value when it compares against the boundary.

When we deliver a system, we need to understand if it is performing as desired. Like a car, we shall make sure metrics that can indicate the operation status of a system are captured. The first metric I would capture, is to determine if the system is up and running.

Do you know the site was not accessible since an hour ago?

It’s so embarassing when it is the user who reported the outages to my boss, before I was aware of it. Being the first one to notice any outage is the number one thing you should try to achieve. This gives the user an impression that you are on top of the happening, an important building block of trust. Also, it is very easy to implement, just need to be cautious on the definition of “up and running”. Being able to ping the server is not enough, if it is a web site, you should make sure the site is accessible (accessible and operational can means much more indeed but anyway).

The New Arrival page is not showing a product, do you know it just meant a site outage in a business sense?

This is what I was told, when I am having a big smile on my face, feeling that I am on top of things. Apparently, knowing the site is up is not enough, if it cannot serve the intent. Thus, what you should do next, is to identify ONE business performance metrics to monitor at. For a transaction site, it could be the last order placed. For a searching site, it could be when the last successful query executed. For me, it would be making sure there are product showing in the New Arrival page. Since the page content is driven by a search result, thus, the way I am doing this, is to look for a magical word, which will be shown if there are products in the returned in the query.

The many facet of a system

Even though your system is very simple, it will usually compose of several components. Network, the application and probably a database too. Thus, when outages arise, it can be either one, or a multiple of the component are having issues. Thus, after you have probes to monitor the key metrics, you should start rolling out monitoring probes on these components, too.

Someone reported that the site is very slow

There was a time that I was greeted with this question rather than “Good morning”. Knowing the application is up, serving the intent is again not enough. We should be aware of how efficient the application is, and again, bring ourselves to the attention when the performance is below the “acceptable region”.

Defining the “Acceptable region” is an art. You can’t just define a value out of imagination, without considering if it is physically achievable. For instance, due to network latency, it is unfair to use a single threshold for all location (edge acceleration can help to resolve this but that’s not the point of my discussion).

Notify yourself pro-actively before things get sour.

We have talked a lot about notification, let’s take a look at the importance of metrics for diagnosis. When I noticed the slow down in performance, first thing I would do is to open my monitoring tools, going through every single components and spot for abnormality. Most of the time, the cause could be identified from the single metric that falls out from the average.

What you are seeing might only be a consequence, not a cause

One day, I’ve got the degradation alert. Both application and page load apdex dropped sharply.

From the graph, page load degradation is due to application performance issue (the section in purple).

So I drilled further and noticed the time it spent on database operation increased.

Naturally, I looked at the DB performance and noticed an increase in response time but througput remains the same.

So what is wrong?

Next, I moved over to check the infrastructure metrics and I spot the following,

One of the network is saturated and this is where the databases are sitting! From there, I noticed abnormality arise from one of the database, which keeps humming data out to one of the backend application.

A restart of this backend application resolved the issues and the site is performing normally again. It takes quite a bit of time to investigate the issue and one of the reason is, the issue arise from multiple components.

To make life easier, we have established a dashboard that summarize issue all in one place. It flashes to draw the attention and from this screen, we know things are not right in serveral places! So we know where to start digging in.

This is just the beginning, with the probes in place, you can develop tools to smoothline your work. Happy Monitoring!

How much traffic can your site handle?

December 02, 2014 by Ronnie Kwok in Technote

Do you know?

Whenever you need to take care of a website, one of the questions that keeps coming back is how capable the site is. So what you will usually do is, fire up the load testing tool and keep hitting the site until it stopped working. Then, go back to the boss and said "Our site can handle this amount of page requests".

Now you know the truth?

The problem of this approach is, it would bring down the site. What's more, the value is only representable in certain scenario. It is good for determining how well the site can deal with the "Slashdot Effect", which one of your URL is being shared wildly. But in reality, the visiting pattern is more random and you may want to reproduce a load test, base on the access pattern.

What is a good visiting pattern?

First thing to be aware of, is to focus on a flow, instead of individual URL. The user lands on the homepage, browse around the category page, before reaching the article of interest. Imagine you have hundreds of people browsing your site, going through their own flow, URL requests are all intervene.

How could you get a sense of the browsing pattern? Google Analytics can give you some very good insight if you have the tracking code in place. In case if you don't, you can go to visit your good old friend - web server log. But mind you up front, you will need a bit of patience to get it through. With session id, or any unique identifier that allows you to track the individual, you can then parse your log files by the principle of,

URI(s) group by session_id order by time ascending

From there, you would be seeing multiple of aggregated URL visit pattern and more likely than not, you cannot spot anything meaningful at all. The number of combinations are just too much to be useful and thus, go back to the log and do some "ETL". Tag URL of the same nature into an alias, repeat the analysis, so that you can see some useful pattern. Splunk allows you to tag pattern and perform the analysis easily.

One approach that I tempted to try but never implemented is to replay the Apache access log via JMeter. The flip side of this is, it resembles the production traffic but on the other hand, it is reflecting the pattern of the particular day only.

Load Time? Almost!

After identifying a pattern, or a distribution of site traffic. You may want to obtain one more piece of information - the user agent’s population.

My company’s web site has two version - desktop and responsive. The page structure is very different, in terms of both presentation and code. We determine which site should be served to the client by their user agent. I use JMeter for load testing, it is easy to setup and plentiful of resources are available to help you. For my setup, I have created a total of 3 thread pools to mimic the traffic. One is for mimicing desktop page requests, the other one for mimicing responsive site page requests and I will talk about the last one later in the article.

The test setup is indeed quite straight forward. But one technique that I would like to share, is the random effect. One of the common visiting pattern for my company’s web site is like,

Go the the “New Arrival” page
Scroll and click on a product that brings them to the product details page

Thus, the first JMeter script I write is to mimic this behavior. The tricky part of the script is to randomly pick an item programatically. Since our product ID is an attribute’s value of a CSS element, and there are multiple of such in the New Arrival page. So, the JMeter script will randomly pick one of the element and parsed the attribute value out of it.

Extract the attribute value from the CSS Element

And we can use the attribute value to substitue the variable defined in the JMeter script.

With the JMeter test plan ready, I can now determine how much traffic my site can handle. Start off with a small amount of concurrent requests and pumps up gradually. This allows us to observe for any abnormality and one factor I will look at, is the GC behavior. With the load, are JVM performing GC more frequently? Is the heap memory stacking up? Data is much easier to obtain with monitoring agent like New Relic. If all the conditions look healthy, I will then start pumping up the amount of request and conduct a real load testing.

How do I know what is the user experience at loading time?

The JMeter test result, together with server metrics (e.g. GC as spoken above) provides a view on how the server perform. But we do not know how the end user experience is like. Thus, let’s bring in the 3rd thread pool I defined in the JMeter script!

The role of this pool is to invoke a browser and obtain a waterfall diagram by visiting pages of interest. Since I have a private WebPageTest instance and thus, I can invoke a request to the page of interest through the WPT API. Apart from the page of interest, we can also gather the performance information of different user agents, too. It's easier to perceive how each components work together with a diagram.

Analysis

During the JMeter load test, we have collected information from various system. Below is a list of metrics that we collect, - JMeter - Successful requests count - Error requests count - Average response time - New Relic - JVM Performance Apdex - CPU usage - Database - AWR report - WebPageTest - Average page load time

Instead of looking into the metrics individually, we have developed a portal page which render all these results in a single location. We are able to look at historical data, too.

Final Words

We begin the project sole for implementing JMeter scripts for conducting load test. But it slowly evolves into an application that is not only firing JMeter script but looking at the website from both server and client side perspective. And we also notice that the idea has been developed as a commercial offering by BlazeMeter. You may want to check it out if you don’t want to build your own infrastructure.

This is it! Heads off to construct your test and prepare for the holiday season!