Assignment 1: Querying Large Files

Learning Objectives

Overview

Suppose your laptop only has 4GB of memory and you would like to query (i.e. search) an 20GB Comma-Separated Values (CSV) file. How do you write such a program ?

The objective of this assignment is to help you understand the issues involved in querying large data sets that are too large to fit in memory in its entirety. To investigate those issues, you will write a java program that will read a table of data in the form of a CSV file and run queries on the table as efficiently as possible.

You will be given a template of the program and you should add your code mostly to the Assignment1.java file (even though you will need to make minor changes/additions to Driver.java as well). A driver program, Driver.java is provided so that you can test your program. The driver program takes as input a file which contains a list of commands to be interpreted and executed by the program. You will be implementing several versions of the program in a guided fashion. You must assume that the data may not fit in memory, i.e., you will not be able to read all the data into an in-memory java data structure.

In all versions, the basic sequence of commands begins by loading the data, followed by a series of queries which are either equality queries or range queries. You may assume that the input is correct and well-behaved, i.e., the goal of this assignment is not error-handling.

A template of the java program is given to you here: http://www2.hawaii.edu/~lipyeow/ics321/core/ics321a1.zip

A test data set is also given to you here: http://www2.hawaii.edu/~lipyeow/ics321/core/orders16mb.zip

The recommended Java IDE for this class is Netbeans (Java EE edition).

Task A (15 pts)

In the first version, you will implement the simplest and most naive solution. The list of commands supported by your java program must include the following:

  1. naiveLoad filename : tells the program that the following queries will be for the csv file with filename

  2. naiveSearchEq columnNum value : prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.

  3. naiveSearchGtr columnNum value : prints the rows of the table where the value in column number columnNum is greater than the given value.

The search commands MUST be implemented by reading the CSV file character by character using the java class FileReader. You should read the Java tutorial on character streams. You MUST use the FileReader class.

Task B (15 pts)

In the second version, you will improve upon the first version by using buffered IO. Write a second version of the search commands using the BufferedReader class. You should read the Java tutorial on buffered streams. Name the commands and corresponding methods as follows:

  1. naiveBufSearchEq columnNum value: prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.

  2. naiveBufSearchGtr columnNum value: prints the rows of the table where the value in column number columnNum is greater than the given value.

Task C (50 pts)

In the third version, you will take a different approach to the problem. You will first load the CSV data file and transform it into a BINARY file. You MUST name your binary file “data.bin”. Subsequent queries will then operate on the binary file. You could use the Java Data Streams and Object Streams classes, but you are free to design the format of the binary file (compression libraries are allowed!). Name the commands and corresponding methods as follows:

  1. binaryLoad filename : transform the CSV file with filename into a binary file. The filename of the binary file should be stored in your program.

  2. binarySearchEq columnNum value: prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.

  3. binarySearchGtr columnNum value: prints the rows of the table where the value in column number columnNum is greater than the given value.

If you do not know what the difference between a binary and an ASCII file, you might want to take a look at this article.

Task D (20 pts)

Take timings of version 1 (task A), 2 (task B), and 3 (Task C) of your program and compare the running times. You should average the timings over at least 10 runs.

In the inline submission on laulima, answer the following questions: 1. Tabulate the average running time of the three versions of your program. 2. Compare the running times of the three versions. 3. How are the timings of the different versions different? 4. Why are the timings of the different versions different ? 5. What did you learn in this assignment? What was most difficult/challenging (if any)?

Additional Instructions

You should use the -Xmx option of the java jvm to simulate a memory size smaller than the actual input data file size.

This is an individual assignment and you may NOT work in groups. General discussions with peers are permitted and encouraged, but you must design and write your own code.

You are NOT allowed to use any database software, database libraries for this assignment. Some third party libraries may be allowed subject to the instructor’s approval.

An extra credit of 10 points will be awarded to students who submit a fully functional code for Task A and Task B one week after the assignment is given. A fully functional program should compile successfully and run correctly without errors.

A small sample data file is provided with the java code template. A larger 16 MB data file is also provided for further testing.

Submission Procedure

Please zip up the directory and submit to Laulima->Assignments.