Marcin Okraszewski Tech Blog

Thursday, February 16, 2012

GZip will always compress text - no matter how random

I've been recently playing a bit to check efficiency of gzip compression in various cases. One conclusion which might not be obvious - gzip will always compress text files, even if they are completely random.

The test which has proven this was very simple - generate 1MB random text and compress it. The generation looked exactly like this:

cat /dev/urandom | tr -cd [a-zA-Z0-9] | head -c $(( 1024 * 1024 ))

Such file has compressed to 769kB, which is 24,9% less. Basically the [a-zA-Z0-9] are 62 characters, which means you can store them on 6 bits, so 2 bits are unused. 2 / 8 is exactly 25%. So with full random characters 25% is maximum you can get. Gzip achieved 24.9%.

Just to add, when there was binary random file generated, without filtering with tr, it wasn't compressed at all. There were 196 bytes added actually.

Wednesday, April 20, 2011

ADO and ADO.Net performance

It is said that ADO.Net performs much better than ADO, but actually I could not find any benchmarks. Well, I did not perform them either but we have recently migrated some of our code from ADO to ADO.Net. It is performing some computation based on data that is in database. There is quite a lot of DB queries and updates. Migration from ADO to ADO.Net has cut the execution time in this part by half. So it was definitely worth doing. The program is running on VB.Net / .Net 4.0 (it is compiled for .Net 4.0).

Monday, March 7, 2011

Embeding revision number in SVN link

I've just needed to send an URL to specific revision of an item in Subversion that did not have a tag. After Googling a bit it turns out that you can add special command in the URL. Just add "!svn/bc/revision_number/" into URL after repository.

For instance URL to current version of Tomcat trunk is:

http://svn.apache.org/repos/asf/tomcat/trunk/

If you want to see how it looked in revision 1,000,000 then just use this link:

http://svn.apache.org/repos/asf/!svn/bc/1000000/tomcat/trunk/

Sunday, February 6, 2011

Backing up with rsync

Rsync is a very powerful copy / transfer / backup tool and probably the least known *nix tools, but it is also available for Windows (with Cygwin). The main purpose of rsync is file synchronization between computers. Backup is actually nothing more than that. Rsync is by default virtually in every Linux distribution, sadly not so often on low-end NAS devices.

What is so appealing in rsync is that apart from very efficient transfer mechanism (only modified files are being transfered) it can make full snapshots of backup while making hard links to unmodified files from previous backup. Hard link is simply additional reference to the same file on disk, so the space is allocated just once. You see the files as normal, you can delete them separately, but disc space will be freed once you delete last hard link. Note - it is not available on FAT, but is available on NTFS or any Linux/Unix FS (ext2, etc.). As a result you have a set of full snapshots of the directories backed up, but only new / modified files take space. You can delete snapshots separately to free up space, even from the middle of chain, without any harm to other snapshots. This way is very convenient to browse if you need to recover something.

Just to show, here is an example how it looks:

# du -sh *
130G 2010-02-10_19-30-01 16K 2010-02-11_19-30-02 16K 2010-02-12_19-30-01 16K 2010-02-13_19-30-01849M 2010-02-14_19-30-02 16K 2010-02-15_19-30-01

As you see, initial backup here was 130GB, then nothing changed for several days, which results in only 16kB per backup (virtually no space) and then there was some update on February 14th with 849MB. Under each folder there is a full file structure.

I've created a simple bash script which does that magic with rsync. On Windows again, you can use it with Cygwin. You can grab it here.

The first parameter is source the second is destination. Both in format accepted by rsync, so for local directories, you just give path, for remote you give [user@]server_host:dir_on_server if you connect over SSH, or [user@]server_host:dir_on_server if you connect over rsync protocol. If you use SSH, the authentication with keys is very convenient in this case. The script can get two more optional attributes - additional options to pass to rsync (you need to quote them if there are spaces in it) and minimum interval between backups in hours.

Some examples.
1. To back up your home directory:

./backup_rsync.sh $HOME server:/path/to/backup

2. To back up directory d:\documents, run in Cygwin prompt:

./backup_rsync.sh /cygdrive/d/documents server:/path/to/backup

Note1. If you back up more than one directory, you need to have separate directories on the server for each backed up directory.

Note 2. To run scripts in Cygwin from scheduler or auto start, you need to give it as paramter to bash, like this:

c:\cygwin\bin\bash.exe -l /cygdrive/c/path/to/script.sh

I'm using it to back up my family computers, at work to back up some servers or source machines for linked clones in VM Ware ESX server. Everything works perfectly for a quite long time.

Friday, February 4, 2011

Doing XSLT transformations with non-standard msxsl:script elements.

I have just needed to transform XML using XSLT which included non-standard msxsl:script element. Even though my application is on .Net, it says it does not support it. Namely there was following exception:

System.Xml.Xsl.XsltException: Scripting language 'VBScript' is not supported.

In my case I needed to transform a file and the result was file as well. So the simplest solution was to use command line version of msxsl and call it from my application. The syntax is straight forward.

msxsl in.xml transformation.xsl -o out.xsl

Good enough for a prototype that I was doing. Unfortunately for Windows only :(

Saturday, January 29, 2011

How to setup browser to automatically switch proxy

I'm a notebook user where in one of my locations I need to use http proxy to access Internet. It is obviously a tedious work to switch it manually every time you change a network. You usually need to go deeply into options to switch it. For Firefox there are fortunately some plugins which simplify that process. But still manual step is needed. I suppose that many people suffer from this and the Firefox plugins are proving it.

But browsers (at least Firefox, IE, Opera) provide a proxy automatic configuration (PAC) script. It is simply a JavaScript which allows you to select proxy settings depending on eg. host being accessed or your IP. And here is the key. If your network with proxy has a unique address space, you can use it to select proxy or direct connection.

To do that just create a file of this type (it typically has pac extension):

function FindProxyForURL(url, host) { if (isInNet(myIpAddress(), "proxy_network_address", "proxy_network_mask")) return "PROXY proxy_host:proxy_port;"; else return "DIRECT"; }

Where:

proxy_network_address is address of the network where you need to use proxy
proxy_network_mask is a mask for the network with proxy
proxy_host address of the proxy to be used
proxy_host port of the proxy to be used

For instance if you are in network 192.168.0.0 with mask 255.255.255.0, which means your IP address is in range 192.168.0.1 - 192.168.0.254, and your proxy host is 192.168.0.10 on port 80 the file would look like this:

function FindProxyForURL(url, host) { if (isInNet(myIpAddress(), "192.168.0.0", "255.255.255.0")) return "PROXY 192.168.0.10:80;"; else return "DIRECT"; }

You can return more than one proxy as a fall back after the semicolon.

Now you just need to point proxy setting to that file. In Firefox you go to Options -> Advanced -> Network -> Settings in "Connection" group -> and set "URL address for automatic configuration". Note it is expected to be in URL format, so you need to use "file://" format. The easiest way to get the address is just open the file (File -> Open or Ctrl-O) and copy the address.

And that is all. No more proxy switching assuming you do not use other network of the same address space. There are also other checks available (eg. if some host name can be resolved) but they would delay load if it cannot be resolved (timeouts).

Note also that there is just one address returned which might be a problem if you have multiple addresses. Firefox has some bugs to return loopback addresses in some cases.

Obviously you may need to have more advanced rules, especially within proxy zone (eg. to provide direct access to intranet sites). There are number of options to do that. Some additional information for PAC can be found here:

Lets start

Finally I start my blog. I mainly plan to archive here some tips or small how-tos that required a bit of knowledge acquisition. I hope that not only I will benefit from that work.