System Design: How to design a tiny URL

来源:互联网 发布:方齐禾的淘宝店是什么 编辑:程序博客网 时间:2024/05/17 22:52

At first glance, this problem seems to very easy. We can solve it with a HashMap. 


However, if you think it more deeper, you will find out HashMap approach does not have scalablity and persistence. Therefore, we need to design a way to implement this.


When you try to design a system, you should think of below three areas: APIs, Application Level and Persistence level. 


At this design system case, we use following definition.


APIs is to define a API that user can use to interact with our provided functions. 

Application Level is how we achieve the function we defined. 

Persistence level is to decide where to save what data. 


-------------------P.S. Other reference-----------------------------------------------


 This is the same pattern like we do in the field of Software Architecture. Some statements below are copied from: 


https://en.wikipedia.org/wiki/Multitier_architecture

https://www.safaribooksonline.com/library/view/software-architecture-patterns/9781491971437/ch01.html




Presentation layer is UI. Business Layer is like a factory class in the factory design pattern, aiming at translating the specific UI input into a complete function call. Persistence layer is how the functions use SQL statement to gather the data that business layer needs. 


We can also use MVC framework to understand this pattern. Model is Persistence Layer since it does the SQL query jobs. Controller is the business layer since it translate the http request into a specific function call (call model's functions). Presentation layer is the http page. 


-------------------P.S. Other reference-----------------------------------------------


APIs: 

String createTiny(String longURL);

String returnLong(String tinyURL);


Application Layer:


Interviewee may at least want to see the following graph:


(from https://www.youtube.com/watch?v=fMZMm_0ZhK4) 




Post request for createTiny and get request for returnLong. 


W1, W2 and W3 are working threads or work nodes.


Cache canbe memcached or redis.


1) Analyse the data scale to design how many chacters we need


Let 's say: a-z 26 A-Z 26 0-9 10 --> 62 possible chacters. 


62^7 ~3.5Trillion (can be represented by 43 bits in binary)


Is it enough? Depends on the request per second. 


If RPS is thousand level, it will takes 100 years to exhaust the avaiable tiny URLs. However, if the RPS is billion level, it will takes several days to exhaust it. 


2) Let's go with the thousand level. Therefore, it is enough for us to have 7 chacters.


3) Implementation of generating unique 7 characters


3.1 Simply generating random URLs and do the following things to make it unique. 




3.1.1 Get tinyURL, if there isn't such record, then put(tinyURL, long URL). The problem for this approach is the racing condition. If two threads save the same tinyURL at the same time. 


3.1.2 Use the if absent sub-sentence of MySQL and Oracle. This implementation is based on the ACID chacteristic of relational database. Only relational database can have such characteristic. No-SQL database does not have such one but they have a good scalablity. 


3.1.3 This approach is the most scientific one. First try put, if you can sucessfully put, you will get the value you put before. If not, generate another random key and then try again. This approach has its own defect which is you may need multiple get request depending on the random algorithm quality. 


3.2 Use existing hash algorithm to help you decrease the possibility of confliction. 


For example, you can use MD5 algorithm to generate a unique key and take former 7 chacters (43bits) as your tinykey. This approach may cause confliction too but it will be much less likely to. Also, it can save some space. The two same long URL can generate the same key by this way. 


3.3 Counter method. 


Counter method is very common in distributed system design. Master-slave pattern is a counter example. Zookeeper pattern is also another patter. All the read request can be distributed to slave nodes or follower/ observer and Master Node or leader node. But the write request should be forwarded to ONLY master node or leader node. Then, the master node or leader node should be the counter. 




3.3.1 Single Node

Every createTiny request should be sent to counter and then counter will maintain a record number (like leader node maintain the propose record number). Every request will increment the number by 1 and the counter will use the number as longkey to send to work nodes in order to generate a unique key. 


Drawbacks: 

1) Single point failure. 

2) How pressure on one node.


3.3.2 Use work node ID + timestamp + increment ID or random ID to generate the 43 bits which can be interpreted as 7 chacters short key


How does it work? Let's assume we have 64 work nodes. We need to use 6 bits binary to represent them. (6 bits)


Then every request has its timestamp (timestamp changes base on seconds). (32 bits)


43 - 38 = 5. Then every request can take a ID with 2^5, which is [0 - 31]. 


It can be randomly or incremented. 


The approach of this is bad when you have more than 32 requests in one second with one node. Or if you use randomly generation method, you have 20 requests/ sec you will also be very likely to encounter confliction. 


3.3.3 Use a seperate node to be used as Zookeeper instance in order to maintain the range


This method can be described by such graph:



 

Every work node will consult range server for their own longkey ID range at initalization stage. Once one of them use up the range, it can re-apply for the other range. It is good because it doesn't have single point failure or possiblity of confliction. 


4) Let's talk about cache. Cache here can use Memcached or Redis. They are all based on memory. 


The main principle is when you createTiny, you should save the new entry at both database and cache. When there is a get request come, you should first search the cache. This is because when you are doing with Twitter application, a new tiny URL is much more likely to be re-visited than the older one. 





原创粉丝点击