URL shortner

来源:互联网 发布:c语言中double 编辑:程序博客网 时间:2024/06/05 23:07

 want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef". Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.

Edit: Due to the ongoing interest in this topic, I've uploaded the code that I used to GitHub, with implementations for Java, PHP and JavaScript. Add your solutions if you like :)

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:

short = '';if divisible by 2, add "a"+the result to shortif divisible by 3, add "b"+the result to short... until I have divisors for a-z and A-Z.

That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?

shareimprove this question
 
2 
@gudge The point of those functions is that they have an inverse function. This means you can have both encode() and decode() functions. The steps are, therefore: (1) Save URL in database (2) Get unique row ID for that URL from database (3) Convert integer ID to short string with encode(), e.g. 273984 to f5a4 (4) Use the short string (e.g. f4a4) in your sharable URLs (5) When receiving a request for a short string (e.g. 20a8), decode the string to an integer ID with decode() (6) Look up URL in database for given ID. For conversion, use: github.com/delight-im/ShortURL – Marco W. Feb 10 '15 at 10:31
 
@Marco, what's the point of storing the hash in the database? – Maksim Vi. Jul 11 '15 at 9:04
 
@MaksimVi. If you have an invertible function, there's none. If you had a one-way hash function, there would be one. – Marco W. Jul 14 '15 at 14:47
 
would it be wrong if we used simple CRC32 algorithm to shorten a URL? Although very unlikely of a collision (a CRC32 output is usually 8 characters long and that gives us over 30 million possibilities) If a generated CRC32 output was already used previously and was found in the database, we could salt the long URL with a random number until we find a CRC32 output which is unique in my database. How bad or different or ugly would this be for a simple solution? – syedrakib Mar 22 at 9:41

19 Answers

activeoldestvotes
up vote493down voteaccepted

I would continue your "convert number to string" approach. However you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.

Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:

  • There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
  • and for every y you must be able to find an x so that f(x) = y.

How to convert the ID to a shortened URL

  1. Think of an alphabet we want to use. In your case that's [a-zA-Z0-9]. It contains 62 letters.
  2. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).

    For this example I will use 12510 (125 with a base of 10).

  3. Now you have to convert 12510 to X62 (base 62).

    12510 = 2×621 + 1×620 = [2,1]

    This requires use of integer division and modulo. A pseudo-code example:

    digits = []while num > 0  remainder = modulo(num, 62)  digits.push(remainder)  num = divide(num, 62)digits = digits.reverse

    Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:

    0  → a1  → b...25 → z...52 → 061 → 9

    With 2 → c and 1 → b you will receive cb62 as the shortened URL.

    http://shor.ty/cb

How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

  1. e9a62 will be resolved to "4th, 61st, and 0th letter in alphabet".

    e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810

  2. Now find your database-record with WHERE id = 19158 and do the redirect.

Some implementations (provided by commenters)

  • Ruby
  • Python
  • CoffeeScript
  • Haskell
  • Perl
  • C#
shareimprove this answer
 
9 
Don't forget to sanitize the URLs for malicious javascript code! Remember that javascript can be base64 encoded in a URL so just searching for 'javascript' isn't good enough.j – Bjorn Tipling Apr 14 '09 at 8:05
2 
A function must be bijective (injective and surjective) to have an inverse. – Gumbo May 4 '10 at 20:28
21 
Food for thought, it might be useful to add a two character checksum to the url. That would prevent direct iteration of all the urls in your system. Something simple like f(checksum(id) % (62^2)) + f(id) = url_id – koblas Sep 4 '10 at 13:53
3 
As far as sanitizing the urls go, one of the problems you're going to face is spammers using your service to mask their URLS to avoid spam filters. You need to either limit the service to known good actors, or apply spam filtering to the long urls. Otherwise you WILL be abused by spammers. – Edward Falk May 26 '13 at 15:34
26 
Base62 may be a bad choice because it has the potential to generate f* words (for example, 3792586=='F_ck' with u in the place of _). I would exclude some characters like u/U in order to minimize this. – Paulo Scardine Jun 28 '13 at 16:02
up vote32down vote

Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z,a-z,0-9 etc') has 40 characters, convert the id to a base-40 number and use the characters are the digits.

shareimprove this answer
 
3 
asides from the fact that A-Z, a-z and 0-9 = 62 chars, not 40, you are right on the mark. – Evan Teran Apr 12 '09 at 16:39
 
Thanks! Should I use the base-62 alphabet then?  en.wikipedia.org/wiki/Base_62 But how can I convert the ids to a base-62 number? – Marco W. Apr 12 '09 at 16:46
 
Using a base conversion algorithm ofcourse - en.wikipedia.org/wiki/Base_conversion#Change_of_radix – shoosh Apr 12 '09 at 16:48
 
Thank you! That's really simple. :) Do I have to do this until the dividend is 0? Will the dividend always be 0 at some point? – Marco W. Apr 12 '09 at 17:04
1 
with enough resources and time you can "browse" all the URLs of of any URL shortening service. – shooshApr 12 '09 at 21:10
up vote27down vote
public class UrlShortener {    private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";    private static final int    BASE     = ALPHABET.length();    public static String encode(int num) {        StringBuilder sb = new StringBuilder();        while ( num > 0 ) {            sb.append( ALPHABET.charAt( num % BASE ) );            num /= BASE;        }        return sb.reverse().toString();       }    public static int decode(String str) {        int num = 0;        for ( int i = 0; i < str.length(); i++ )            num = num * BASE + ALPHABET.indexOf(str.charAt(i));        return num;    }   }
shareimprove this answer
 
up vote21down vote

Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.

shareimprove this answer
 
1 
Thank you, very good idea. I haven't thought about that yet. It's clear that it depends on the kind of use whether that makes sense or not. – Marco W. Apr 12 '09 at 18:22
11 
It won't be an issue if people are strictly copy-and-pasting the short urls. – Edward Falk May 26 '13 at 15:35
up vote18down vote

My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.

shareimprove this answer
 
 
Thanks, you're right. Whether you have 2,176,782,336 possibilities or 56,800,235,584, it's the same: Both will be enough. So I will use base 36 encoding. – Marco W. Apr 14 '09 at 18:22
 
It may be obvious but here is some PHP code referenced in wikipedia to do base64 encode in php tonymarston.net/php-mysql/converter.html – Ryan White Jul 13 '10 at 15:33 
0 0
原创粉丝点击