View on GitHub

Implementation of SA-IS Suffix Array Construction Algorithm

Download this project as a .zip file Download this project as a tar.gz file

Abstract

Suffix arrays are considered to be fundamental data structures in applications such as data indexing , retrieving , storing and processing. They are particularly used within the field of bioinformatics to index large strings of genome sequences and querying the same. Suffix arrays allow for improved space requirements , easier linear time construction algorithms and improved cache locality.

There are many SACA implementations like Difference Cover, Deep Shallowing, Ko Aluru, SA-IS. According to the paper Linear Suffix Array Construction by Almost Pure Induced-Sorting , SA-IS is found to be the most efficient in terms of time and space.In the KA (Ko Aluru) algorithm sorting the variable size S or L-substrings becomes the bottleneck. The SA-IS tries to solve this problem by using a new induced sorting method to sort the variable size LMS strings and thus being fast and light weight.

In this project , I will write an implementation for the SA-IS Algorithm described in the papers Linear Suffix Array Construction by Induced Sorting and Linear Suffix Array Construction by Almost Pure Induced-Sorting

I will also perform comparison of memory and time analysis of the implementation with other SACA (Suffix Array Construction Algorithms) implementations :: Difference Cover Algorithm, Deep Shallowing Algorithm , Ko Aluru Algorithm to confirm the experimental results mentioned in the paper.

Plan of Action

What will I implement ?

I will write an implementation for the SA-IS Algorithm described in the papers http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4976463 (Linear Suffix Array Construction by Induced Sorting) and http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5582081 The implementation will be done in Java.

What methods will I compare and how will I get them ?

I will compare the following methods with SA-IS method

Which datasets will I use ?

I will conduct tests on files obtained from the following dataset:

Manzinis Lightweight Corpus

What kind of experiments will I run and what will be measured?

I will run the above algorithms for long string sequences (datasets) and will measure the time and memory consumed for the different algorithms.

Papers to Read

Support or Contact

For additional information please contact Anirudh Subramanian (@anirudh2290) UFID(94453124).