Abstract
Suffix arrays are considered to be fundamental data structures in applications such as data indexing , retrieving , storing and processing. They are particularly used within the field of bioinformatics to index large strings of genome sequences and querying the same. Suffix arrays allow for improved space requirements , easier linear time construction algorithms and improved cache locality.
There are many SACA implementations like Difference Cover, Deep Shallowing, Ko Aluru, SA-IS. According to the paper Linear Suffix Array Construction by Almost Pure Induced-Sorting , SA-IS is found to be the most efficient in terms of time and space.In the KA (Ko Aluru) algorithm sorting the variable size S or L-substrings becomes the bottleneck. The SA-IS tries to solve this problem by using a new induced sorting method to sort the variable size LMS strings and thus being fast and light weight.
In this project , I will write an implementation for the SA-IS Algorithm described in the papers Linear Suffix Array Construction by Induced Sorting and Linear Suffix Array Construction by Almost Pure Induced-Sorting
I will also perform comparison of memory and time analysis of the implementation with other SACA (Suffix Array Construction Algorithms) implementations :: Difference Cover Algorithm, Deep Shallowing Algorithm , Ko Aluru Algorithm to confirm the experimental results mentioned in the paper.
Plan of Action
What will I implement ?
I will write an implementation for the SA-IS Algorithm described in the papers http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4976463 (Linear Suffix Array Construction by Induced Sorting) and http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5582081 The implementation will be done in Java.
What methods will I compare and how will I get them ?
I will compare the following methods with SA-IS method
Which datasets will I use ?
I will conduct tests on files obtained from the following dataset:
What kind of experiments will I run and what will be measured?
I will run the above algorithms for long string sequences (datasets) and will measure the time and memory consumed for the different algorithms.
Papers to Read
- Two Efficient Algorithms for Linear Time Suffix Construction
- Linear Suffix Array Construction by Almost Pure Induced Sorting
- Taxonomy of Suffix Array Construction Algorithms
- Space Efficient Linear Time Construction of Suffix Arrays
Support or Contact
For additional information please contact Anirudh Subramanian (@anirudh2290) UFID(94453124).