Unrolled inner loops and used gcc vector aritmetics.