Welcome to the VirtuQ Forums.

basant

Art of Programming: Data Structures

Rating: 3 votes, 4.67 average.

Informally, data structure (DSTR) is the way in which data in the program is organized and manipulated. The latter happens through various operations which are supported by the data structure. Examples of these operations are making new objects in DSTR, removing existing objects, searching for a particular object, sorting based on some criteria etc.

As an example, suppose, we have to create and manipulate a student database. Naive approach to create this database could be to create a record or structure for each student and put them one after another through a link. This structure is called a linked list. We will have to support several operations on this DSTR such as insert a student entry, delete an entry, search details of a student based on his name, sort all the entries based on marks etc.

As another example, consider the problem of mapping software onto hardware i.e. convert the program written in a high level language such as C/C++ in the executable form which hardware can understand and run. This conversion is done by a software tool called compiler. When a C/C++ (or any other language) compiler first reads the program, it internally converts the program in the form of connected nodes. These nodes are program elements such as a loop, an assignment, an arithmetic expression etc. This data structure is called a parse-tree. This form is also known as Intermediate Format. Examples of such formats (or data structures) are SUIF (Standford University Intermediate Format), LLVM (Low Level Virtual Machine) etc. Compiler has to perform several basic operations such as going over various nodes in the parse-tree (also called traversal), search and convert high level nodes such as conditional statements into low level structures etc. Using these operations, the high level parse-tree can be systematically converted to machine or assembly format which eventually gets converted to executable format before it can be run on the hardware.

When we closely look at these programming problems from a real life perspective, it is not hard to see that it is nothing but first deciding how to organize the data and then working details of operations to access and manipulate the data.


Programming should start with designing data structure

A program is all about processing data. A program will get inputs in some form, process it once or many times on different data sets and output in some form. Inputs could be from a file, an external database, network stream, interactive session with the user etc. Processing could be some analysis, changes in the database, new entries etc. and output could be writing back to files, display in the GUI, an output stream in the network etc. Since all we are talking about is data representation and processing, working out details of the data structures must be the first thing in the program design. Writing code makes sense only after answering the following questions.
  1. In what form input is coming
  2. What manipulation and analysis is required on the inputs.
  3. How to store the inputs.
  4. What is the basic structure of the data.
  5. What are the operations needed on the data structure.
  6. Can I make my operations more efficient by refining my data structure. This refinement could be in the form of additional book-keeping, using structure A instead of B based on the information of frequency of various operations etc.
  7. Which language I should be using to implement the program. There could various factors affecting the choice of language such as prototype quality, production quality, performance, memory usage, platform (desktop, tablet, mobile etc.), portability, error handling, reusability, testability etc. Typically, in production environment, language may be pre-decided.
  8. Can I effectively test my data structure.
  9. Can I make use of any existing library.
Once the above questions are answered, writing code to implement the program is going to be straightforward because now we just have to implement the data structure piece by piece.

Choice of data structure can make or break the program

We discussed various questions around data structures which need to be answered before moving onto writing code. Let us now go through the process of refining the data structure of a problem to better appreciate it.

Student database problem
Let us consider the student database problem. We need to create the student database of a college. This college has around 5000 students. Each student entry has his name, address, class, expected graduation year and current marks. The database will be stored in a file. Further we should have the following operations on this database.

  • Insert a new entry
  • Delete entries of graduated students
  • Search a student details based on his name
  • Search details of all students at a particular marks
  • Search details of all students in a particular class
  • Sort students of a class based on marks
Solution1: Array based database
Suppose we are writing the program in C++. The naive approach to implement the above could be to create a structure/class for a student entry and store all the entries in a big array. Let us see what difficulties we might face.

  • Number of students is not fixed. Suppose we allocate an array assuming an upper bound of 10000 entries. If the number of students does not increase much, we might end up wasting lots of memory.
  • Insertion is straightforward as we just have to increase the count and insert the entry in the array. However, as we keep on deleting the entries, we will get many holes in the array. To counter this problem, we need to keep track of valid entries, otherwise we will end up overrunning the entire array.
Solution2: List based database
Because of the above problems, array will be pretty bad choice as the program data structure. So let us refine this and choose a data structure where memory is not wasted and it is allocated on demand basis. Further to support the search operations, the entire database should be properly linked. This naturally leads to linked list.

  • Insert can be done at the end of the list in a fixed number of steps. This kind of access is also called constant time access. List can grow/shrink naturally.
  • To delete entry of a particular student, we first have to find it. Suppose this entry is somewhere towards the end of the list, then we may have to go over almost all the entries before we find corresponding students details. This kind of access is also called linear access as in the worst case, number of entries touched could be almost same as total number of entries.
  • Further, each search is linear time.
  • Sorting can involve number of steps of the order of n2 or nlog2n where n is the total number of entries.
Solution3: Combining list with maps
Typically search is the most frequently done operation. Insertion and Deletion is going to be just one time. If we use a list, searches are going to be linear and quite slow. How can we make it faster? We can do so, by doing some additional book-keeping through maps and hash_maps. As of now, we need not worry about the full details of map or hash_map. For explanation, we can just assume the access properties of these data structures.

  • Maintain a name based map. Values in this map will be student entries. Now name based search will be of the order of log2n time. This is faster than linear access.
  • Maintain a hash_map for marks to student entries. This will make marks based search constant time.
  • Maintain a hash_map for class to student entries. This will again make class based searches in constant time.
There is some overhead for the above refinements. Insertion and Deletion will be slightly slower. More memory will be required for the maps. This will make memory needed almost two times more that of bare minimum linked list. If we can afford these overheads, we should choose the last data structure.

What we have seen through the above example is how data structures for a program can be systematically refined. As we go through this process, we will notice that some data structures will not work and some will not be efficient. Hence proper efforts need to be put to design good data structure for the program as it takes the program long way. Actual implementation should be done only after this stage.

How to program with data structures

We can take the following approach.
  1. Work out details of the data structure as discussed above
  2. Choose the language for implementation based on various factors as outlined earlier
  3. Work out details of how the input and output will happen
  4. Write code piece by piece. Also test these pieces in the incremental manner. These small tests are called unittests.
  5. Perform flow tests where the entire program works together.
How can VirtuQTM help

VirtuQTM offers courses on Practical Data Structures and Algorithms. The focus of these courses is to take you through programming experience of various commonly used data structures and algorithms. After going through these courses, in a short period of time, you will be able to make a decision about the right data structure for your programming problems and make needed trade-offs. Apart from helping you make decisions, our exercises will provide you much needed programming exposure and help you avoid common pitfalls.
So join us now for a software architecture voyage.

Comments