Assignment 1: Querying Large Files

Learning Objectives

Understand file encoding
Understand disk I/O
Able to write a program to query a large CSV file

The Introduction

What is a file? At the most basic level a file is just a collection of bits, a series of ones and zeros, but without structure those bits are meaningless. One common structure used to interpret bits is the ASCII text standard which interprets 8 bits as one character. For example
00110000001100010011001000110011
or the same thing with spaces
00110000 00110001 00110010 00110011
is a valid ASCII string that translates to “0123”. However, that’s the not only way to read that series of bits. If you count there are 32 bits which is the exact size of an integer and when interpreted as an integer
00110000001100010011001000110011 = 808530483
We could also interpret these bits as a float, as machine code or perhaps as a more complex data structure like a huffman tree.

What you should take away from this is that 32 bits can encode a variety of information. If you’re running a bank you might prefer to store large numbers and use the 32 bits as an integer. However, if you’re running a bookstore you might prefer to use those 32 bits as a 4 character string as a genre listing. 32 bits can be different things to different people.

So that’s what files are, but where do they live? Ideally we would like to get lucky and have our files live in Random Access Memory (RAM) because it’s fast, but unfortunately the contents of RAM are lost when power is lost. Instead we have storage (Hard drives, usb sticks, optical disks, etc) which can keep bits between power cycles and as a result files end up living in storage. Storage is great as it provides the functionality needed to keep your totally legitimate music collection between reboots, but unfortunately it’s also very slow. Storage is very slow, 1,000+ times slower when compared to RAM.
See: https://gist.github.com/hellerbarde/2843375
for a reference on how slow different parts of a computer are. We want to go fast, so having an understanding of your storage is necessary to use your computer efficiently. Different storage media have different performance characteristics, but for this assignment we’ll focus on hard drives.

Hard drives work by spinning a magnetically active platter and floating a read/write head over it. The platter only spins in one direction and spins at a constant rate (which will vary from model to model). So, we can infer that the worst case access time is when we need whatever is directly behind the current position of the head. This worst case access will take a full disk rotation of time to finish and this is access time is what we call latency. Conversely whatever data is directly in front of the read head is the quickest to access. In general we think about latency in averages over many operations and thus we can just talk about average latency. For a hard drive a 10ms average latency is roughly what you can expect. However, latency is not the only component of disk performance. Throughput also needs to be considered and is usually given as a constant (Throughput will also vary from model to model). For example, say you need to read 100MB of data and let’s say that to begin this transfer you need to first need to take a latency hit. How long does the transfer take to complete? Well it’s the
latency plus the time to read 100MB.
Say we have a run of the mill hard drive which has a throughput rating of 100MB/s and an average latency of 10ms Transfer time (on average) takes
10ms + 100MB/100MB/s = 10ms + 1s = 1.01s
What you’ll notice is that the 1s needed to read the 100MB dominates the latency, so you might be inclined to ignore the latency, but say you want to read something much smaller. Say you want to read 32bits (4 bytes). The calculation is the same
10ms + 4B/100MB/s = 10ms + 40ms = 50ms
But the latency is now 20% of the total time spent getting the data you want.
Further, if you try to read 100MBs, 4Bs at a time you get 25000000 4B reads with a runtime of
(10ms * 25000000) + 25000000*4B/100MB/s = 250000s + 100MB/100MB/s = 250000s + 1s
250001 seconds is roughly three days. Be conscious.

The Assignment

Suppose you’ve just been hired as a database administrator for a small company and are tasked with managing their “database”. Suppose further that their last database administrator was an idiot and simply stored customer records as lines in a text file with each field separated by a comma.
ex.
LEROY,Jenkins,M,1337,0.4233508,1973-12-24
The format is consistent throughout the file, but it’s a huge file and reporting by hand is a pain. Your task is to automate reading the file and to build a query framework to interface with it.

Since this is a learning exercise and not a job you will be provided with a program to generate large files of the correct format. Make sure to test your code on a file which does not fit in the RAM on your machine as that’s the size of file you will be graded on. You will also be provided with a file to run a series of queries for you. Skeleton code is provided and you’re free to change anything you like so long as the method signatures do not change

Task 1 - A naive approach

The file is a text file, so one simple way to read it is character by character. Use the java FileReader to do this. Implement the following methods:

public void naiveLoad(String fileName):
tells the program that the following queries will be for the csv file fileName.
public ArrayList<String> naiveSearchEq(int columnNumber, String value):
returns the rows of the table where the value in column number columnNumber is equal to the given value. Column numbers start from one.
ex.
naiveSearchEq(3, "F");:
should return all rows with the character ‘F’ in the third column.
public ArrayList<String> naiveSearchGtr(int columnNumber, String value):
returns the rows of the table where the value in column number columnNumber is greater than the given value.
ex.
naiveSearchGtr(4, "15");:
should return all rows with the an integer or floating point value greater than or equal to 15.

Task 2 - A slightly more intelligent approach

You may have noticed that for very large files the naive approach is very slow. This is because you’re only reading one character per disk access and are incurring seek penalties for every character. So, what can be done? A very simple fix is to change the file access method being used. Use the java BufferedReader class to implement the following methods. Use the same character by character read strategy.

public void bufferLoad(String fileName):
tells the program that the following queries will be for the csv file fileName.
public ArrayList<String> bufferSearchEq(int columnNumber, String value):
returns the rows of the table where the value in column number columnNumber is equal to the given value. Column numbers start from one.
ex.
bufferSearchEq(3, "F");:
should return all rows with the character ‘F’ in the third column.
public ArrayList<String> bufferSearchGtr(int columnNumber, String value):
returns the rows of the table where the value in column number columnNumber is greater than the given value.
ex.
bufferSearchGtr(4, "15");:
should return all rows with the an integer or floating point value greater than or equal to 15.

The BufferedReader provides the same infrastructure to you as the FileReader but buffers some of the file for you and you should see a large speedup as a result.

Task 3 - Recode the file

The CSV format does not lend itself to being searched quickly, so lets change the format of the file. Logically our file is String,String,Char,int,float,date But all the information in the file is in the ASCII format which effectively inflates the int and float types. So, create a class to serve as a template for each line. Create an instance of this object, load it with the data from a row and write it out to a new file using the java ObjectOutputStream Implement the following methods to read this new file. Use the java ObjectInputStream to interface with the file

public void objectLoad(String fileName):
Tells the program that the following queries will be done on the file fileName.
public ArrayList<String> objectSearchEq(int columnNumber, String value):
returns the rows of the table where the value in column number columnNumber is equal to the given value. Column numbers start from one.
ex.
objectSearchEq(3, "F");:
should return all rows with the character ‘F’ in the third column.
public ArrayList<String> objectSearchGtr(int columnNumber, String value): returns the rows of the table where the value in column number columnNumber is greater than the given value.
ex.
objectSearchGtr(4, "15");:
should return all rows with the an integer or floating point value greater than or equal to 15.

Task 4 - Runtime analysis

Run the three different versions of your program a few times (at least 3) and tabulate the results. Use a common data file so that the queries return the same rows. In a report be clear about how many trials you ran and do the following
1. Tabulate the average running time of the three versions of your program.
2. Compare the running times of the three versions.
3. How are the timings of the different versions different?
4. Why are the timings of the different versions different?

Testing

The AutoRun.java file provides a basic test case, but can be overridden by providing a command file as an argument.
ex. java AutoRun myCommands
The command file should be of the following form
naiveLoad ics321_random_data_file
naiveSearchEq 3 F
naiveSearchGtr 4 20000
...
bufferLoad ics321_random_data_file
bufferSearchEq 3 F
bufferSearchGtr 4 20000
...
objectLoad ics321_random_data_file
objectSearchEq 1 LEROY
objectSearchGtr 4 15000
...
Please talk to the TA if this is not clear.

Downloads

Assignment1.java
AutoRun.java
FileGenerator.java

Additional Instructions

This is an individual assignment and you may NOT work in groups. General discussions with peers are permitted and encouraged, but you must design and write your own code.
You are NOT allowed to use any database software, database libraries for this assignment. Some third party libraries may be allowed subject to the instructor’s approval.

Submission Procedure

Submit your code via Laulima.