You will provide four implementations of the Dynamic Set ADT (see below, modified from Topic 4 Notes). You will then analyze their expected performance (this has been largely done in the lectures and textbook), and then test them empirically to see how your test data compares to expected performance. You will write a report on the results. The project is worth 60 points.
You may either use open source implementations of the following data structures (see the Assignments Page Section “Including Open Source Software” for requirements), or implement them yourself:
There are several advantages to implementing at least some of these yourself:
You will then write your own code, including:
We will provide you with data in the form of files of keywords (automatically generated names). There will be ten data files, three unsorted and three sorted of size 100, 1000, 10,000, 100,000 and 1,000,000. (See discussion below concerning the largest file.) We may use other data files in our testing, so make your code robust!
Your accompanying report will provide the minimum, average and
maximum times in nanoseconds required for: insert
, retrieve
,
successor
, and predecessor
, and the actual time for minimum
and
maximum
, for each of the input files. You will compare these results to the
theoretical analyses for the algorithms. While delete
is not included in the
timing tests, we will test that deletion works.
Details follow. Questions and requests for clarification are welcome and should be submitted early.
Note that this ADT has been changed: it is not the same as the one in the lecture notes, and it is not the same as the ones used in my 2012 or 2013 classes. Also, it was changed to use generics.
1 package ics311;
2 import java.util.Map.Entry;
3
4
5 /**
6 * This interface is to be used for all of the Dynamic Sets you create for this assignment.
7 *
8 * @author Robert Ward
9 * @version 1.0
10 * @since 2014-02-01
11 */
12 public interface DynamicSet<Key extends Comparable<Key> , Value> {
13
14 /**
15 * Returns the name of Data Structure this set uses.
16 *
17 * @return the name of Data Structure this set uses.
18 */
19 public String setDataStructure();
20
21 /**
22 * Returns the number of key-value mappings in this set.
23 * This method returns zero if the set is empty
24 *
25 * @return the number of key-value mappings in this set.
26 */
27 public int size();
28
29 /**
30 * Associates the specified value with the specified key in this map.
31 * If the map previously contained a mapping for the key, the old value is replaced.
32 * If there is no current mapping for this key return null,
33 * otherwise the previous value associated with key.
34 *
35 * @param key - key with which the specified value is to be associated
36 * @param value - value to be associated with the specified key
37 *
38 * @return the previous value associated with key, or null if there was no mapping for key.
39 * (A null return can also indicate that the map previously associated null with key.)
40 */
41 public Value insert(Key key, Value value);
42
43 /**
44 * Removes the mapping for this key from this set if present.
45 *
46 * @param key - key for which mapping should be removed
47 *
48 * @return the previous value associated with key, or null if there was no mapping for key.
49 * (A null return can also indicate that the map previously associated null with key.)
50 */
51 public Value delete(Key key);
52
53
54 /**
55 * Returns the value to which the specified key is mapped, or null if this set contains no
56 * mapping for the key.
57 *
58 * @param key The key under which this data is stored.
59 *
60 * @return the Value of element stored in the set under the Key key.
61 */
62 public Value retrieve(Key key);
63
64
65 /**
66 * Returns a key-value mapping associated with the least key in this map, or null if the set
67 * is empty.
68 * IMPORTANT: This operation only applies when there is a total ordering on the Key
69 * Returns null if the set is empty. If there is not total ordering on the Key returns null.
70 *
71 * @return an entry with the least key, or null if this map is empty
72 */
73 public Entry<Key, Value> minimum( );
74
75
76 /**
77 * Returns a key-value mapping associated with the greatest key in this map, or null if the
78 * map is empty.
79 * IMPORTANT: This operation only applies when there is a total ordering on the Key
80 * Returns null if the set is empty. If there is not total ordering on the key returns null.
81 *
82 * @return an entry with the greatest key, or null if this map is empty.
83 */
84 public Entry<Key, Value> maximum( );
85
86
87 /**
88 * Returns a key-value mapping associated with the least key strictly greater than the given
89 * key, or null if there is no such key.
90 * IMPORTANT: This operation only applies when there is a total ordering on the key
91 * Returns null if the set is empty or the key is not found.
92 * Returns null if the key is the maximum element.
93 * If there is not total ordering on the key for the set returns null.
94 * @param key - the key
95 *
96 * @return an entry with the greatest key less than key, or null if there is no such key
97 */
98 public Entry<Key, Value> successor(Key key);
99
100 /**
101 * Returns a key-value mapping associated with the greatest key strictly less than the given
102 * key, or null if there is no such key.
103 * IMPORTANT: This operation only applies when there is a total ordering on the key
104 * Returns null if the set is empty or the key is not found.
105 * Returns null if the key is the minimum element.
106 * If there is not total ordering on the key for the set returns null.
107 * @param key - the key
108 *
109 * @return an entry with the greatest key less than key, or null if there is no such key
110 */
111 public Entry<Key, Value> predecessor(Key key);
112
113 }
If you wish, to simplify things you may change KeyType to String. The data
Object d
stored under a key is not important for our testing: just give it
the key k
.
In a real implementation, returning a key in minimum
, maximum
, successor
and predecessor
would be inefficient if the client then has to use
retrieve
to get the data. This could be solved by using the position
abstraction, or by returning multiple values (key + data). But for our
purposes we only need the key to test runtime.
You will have four classes implementing this interface, named to reflect the implementation:
DLLDynamicSet
: Doubly Linked List implementation, using OSSSkipListDynamicSet
: Skip List implementation, using OSSBSTDynamicSet
: Binary Search Tree implementation, using OSSRedBlackDynamicSet
: Red Black Tree implementation, using Java TreeMapTo implement the ADT you need to write the methods that are specified by the
interface but map to the underlying implementation. As a simple example, in
RedBlackDynamicSet the interface method retrieve (KeyType k)
will be
implemented by calling TreeMap’s get(Object key)
. This is trivial code, but
it accomplishes an important objective: hiding the details of the
implementation behind the interface.
We suggest that you implement RedBlackDynamicSet
first and get the rest
of the program running on it, as you already have this implementation. Then
either write or look for OSS of the other three and add them to the testing.
In general, you are free to make your own implementation decisions and use the best practices you are capable of. This includes whether or not you use Comparable or generics, whether you use locator abstractions (such as Goodrich & Tamassia’s “position”) rather than returning pointers to data stored, how much error checking is done, etc. The important thing is that you have implemented the set operations in a general way, and in particular can run the tests.
We suggest that you organize your code in packages, but this is not required.
Files including from 100 to 1,000,000 keys, sorted and unsorted, are available in this directory. Each file has one name on each line. There are no quotations or other delimiters other than newline. For example:
Hugh Moreno
Traci Obrien
Doug Moore
Sammy West
Anne Reid
Deanna Zimmerman
Marcus Waters
Clyde Walton
Matthew Rios
Jacqueline Robertson
...
Your program will take the input file name as an argument at the command line (i.e., read into String args). It will read the file into an array until end of file is reached. We need to read into an array because we will be randomly selecting elements for test runs. You do not need to test the input for any properties: just read each line from the file as if it is a string.
Your program will read an argument from the command line: the name of the file to be loaded in this test run.
Read the string keys in the specified file into an array. Count the keys as you read them so you know what your n is. You will then use this array to drive a test of each of the Dynamic Set implementations, as discused in the next section.
Screencasts are available showing how to read in data from a file if you have forgotten.
Each runtest
run will use System.nanoTime()
(we have found that
System.currentTimeMillis()
is not precise enough for O(lg n) algorithms) to
time the following operations and report minimum, average and
maximum times required:
insert(k):
min, avg and max across all keys. (You have all the keys in the array. Insert all of them into the data structure, timing the time it takes for each insertion and keeping track of min, max and sum of times as you go so you can compute average at the end.)
retrieve(k):
Use the Random
class to produce 10 integers in the range of the data array. Then search for the 10 keys (name strings) indexed by these random numbers, timing each search. Report min, avg and max across these 10 searches.
successor(k):
run 10 trials on the same keys you used in the above retrieval tests. Report min, avg and max.
predecessor(k):
run 10 trials on the same keys you used in the above retrieval tests. Report min, avg and max.
minimum:
Report time to run one call (since there is only one minimum).
maximum:
Report time to run one call (since there is only one maximum).
delete(k):
Last, delete the 10 randomly chosen keys you used above, timing the deletion. Report min, avg and max.
(_Note: _ 10 has been specified as the number of trials because the smallest data set has only 100 elements. 10 trials will give you meaningful results, but in real world testing you’d run many more trials for more reliable and convincing averages. Compared to the cost of programming this test, computer trials are cheap! One approach is to use a percentage, e.g., 10% of the keys would give you 100 trials on the data set of 1000.)
If you run out of memory, increase the heap size available to Java using the -Xmx argument (or set the corresponding preference in your IDE). Try 1024M, or higher if you have the disk space to spare. (For one of my Java-based research projects, I set this value to 32G, which is twice the physical memory on my 16G machine: Java will use virtual memory.)
Also try loading the data directly into one Dynamic Set implementation at a time (skipping the array and the interactive interface), and then destroying all references to each dynamic set after you test it so it can be garbage collected before loading into the next one from the file.
If you are unable to run the test file of one million, turn in up to 100,000 along with an explanation.
Each runtest
run will generate an output table as follows (this layout is
designed to enable you to print the table out incrementally as you go).
Size: N
--------------------------------------------------------------------------------------
| Insert | Retrieve | Pred | Succ | Min | Max | Delete |
--------------------------------------------------------------------------------------
Linked List | | | | | | | |
--------------------------------------------------------------------------------------
Skip List | | | | | | | |
--------------------------------------------------------------------------------------
Binary Tree | | | | | | | |
--------------------------------------------------------------------------------------
RB Tree | | | | | | | |
--------------------------------------------------------------------------------------
The entry in each cell of this table will be of form
-------------------
| min / avg / max |
-------------------
(except minimum
and maximum
), each being in nanoseconds. (Resize the table
as needed to fit.)
You will need to run the program once for each data set: 100, 1000, 10,000, 100,0000, and, if possible, 1,000,000; each sorted and unsorted. Thus you will have at least 8 runs (8 tables) if you go up to 100,000, or 10 runs (10 tables) if you are able to include the 1,000,000. Your reports will be based on these tests, in comparison to the Θ, big-O, and Ω analyses for the algorithms.
The TA may use other data sets, which may include reverse sorted items, duplicates, etc.
Requirements stated on the Assignments Page are included in the requirements for this project. Read that first. Further comments are below.
The Readme file should be clear and enable the TA to compile and run your program.
For this assignment, the Operation manual is not required _ as long as you put the instructions on how to run the program in your Readme.txt._ The Reference manual can be brief, as the program as a whole is not intended to be distributed for general use. Describe the organization of your program and the source of the implemenatations: anything that you as a programmer maintaining the code would want to know.
The Testing Document will include a summary of your empirical test results and a discussion of how these results compare to the asymptotic analyses of the underlying data structures.
Put the tables output by your program in the appendix of your testing document. Then, in the beginning of your testing document discuss whether the times you got fit the theoretical analyses for the various data structures. Did implementations behave as expected as n gets bigger? Does the different between sorted and unsorted data make sense? How do they compare to each other?
This document is 30% of your grade: be thorough (showing you have thought carefully about the results in relation to the theoretical analyses and your implementation), yet concise (don’t put in a lot of blather to make it look bigger).
Dan Suthers Last modified: Tue Mar 11 03:27:03 HST 2014