Skip to main content

Command Palette

Search for a command to run...

Techniques for Understanding File Decompression and Writing a Java Unzip Program

Have you ever downloaded a .zip file from the internet and wondered what magic happens when you click "Extract"? File compression and decompression are everyday miracles of computing that we often take for granted. From shrinking large folders in...

Published
16 min read
Techniques for Understanding File Decompression and Writing a Java Unzip Program
T

I am Tuanh.net. As of 2024, I have accumulated 8 years of experience in backend programming. I am delighted to connect and share my knowledge with everyone.

1. Understanding How File Decompression Works

A conceptual illustration of file extraction (decompression), where compressed data is expanded back into its original form. When you compress a file, a specialized algorithm encodes the original data using fewer bits, often by removing redundancy. Decompression is the reverse process: the compressed data is decoded to reconstruct the original file in its exact form. In other words, compression software acts as an encoder that compacts the data, and decompression software acts as a decoder that expands it back. Importantly, most common file archives (like ZIP) use lossless compression, meaning no information is lost in the round trip of compressing and decompressing – you get out the same bits you put in. To see how this is possible, let's examine the principles and algorithms that make it work.

1.1 Redundancy and Lossless Compression Basics

Think of a book where certain words or phrases repeat many times. Instead of writing each of those repeated phrases in full every time, you could create a small numbered code for each phrase and write the code instead. This is the basic idea of compression: find repeating patterns and replace them with shorter representations. Most files (text, images, etc.) contain a lot of repetitive or redundant data. For example, a simple sentence might have many repeated words or letters – data compression algorithms remove this redundancy by cataloging the repeats. During compression, the software builds a kind of "dictionary" of repeated sequences and replaces the actual data with references to that dictionary. During decompression, the software uses the same dictionary (which is stored in the compressed file) to translate those references back into the original data sequences, effectively reconstructing the file.

Let's consider a trivial example: the phrase "ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU — ASK WHAT YOU CAN DO FOR YOUR COUNTRY". This famous quote has many repeated words. Instead of storing the full phrase, a compression program could list each unique word once (e.g. "ask", "what", "your", "country", "can", "do", "for", "you") and then represent the quote by referring to the index of each word. The compressed data would contain the dictionary of unique words and a sequence of indexes. When decompressing, the program reads the dictionary and the sequence of indexes, then reconstructs the original sentence by replacing each index with the corresponding word. This way, the repeated words only take up space once (in the dictionary) during storage. In a real file with thousands of words or bytes, eliminating redundancy can greatly shrink file size. Crucially, lossless decompression will perfectly restore the original sequence of words or bytes using the stored dictionary and instructions.

1.2 Common Compression Algorithms Used in Archives

Image

Over the decades, computer scientists have developed various clever algorithms for compression. Some widely used lossless compression algorithms include:

  • Huffman Coding: A method of assigning shorter binary codes to more frequent bytes and longer codes to less frequent ones, yielding an optimal prefix code. Huffman coding is often used as a final step in compression (for example, ZIP files use Huffman coding in their DEFLATE algorithm) to encode data compactly.
  • Lempel-Ziv (LZ77/LZ78) and Derivatives: A family of dictionary-based algorithms that substitute repeating sequences with references. For instance, the DEFLATE algorithm (used in ZIP and gzip) combines LZ77 (which finds repeated substrings and replaces later occurrences with backward references to earlier data) and Huffman coding. Variants like LZW (Lempel-Ziv-Welch) are used in GIF images and other formats.
  • Run-Length Encoding (RLE): A simple scheme that compresses runs of identical bytes or pixels by storing the value and count rather than the raw repetition. For example, a sequence "AAAAAAA" (7 A's) can be stored as "A:7". This works well for data with long runs of repeats (like simple graphic images or whitespace in text).

Compression software (like zip tools) may use one or a combination of these algorithms. The ZIP file format, for example, allows different algorithms for each file entry, though DEFLATE is by far the most common. The key point is that decompression software must know which algorithm was used in order to decode the data correctly. This is why compressed files often have headers or metadata indicating the compression method. When you open a .zip archive, the extractor program reads the file's headers to determine, say, "this file was compressed with DEFLATE," and then it applies the DEFLATE decompression routine to recover the original data.

1.3 The Decompression Process Step by Step

Let's walk through what happens internally when you extract files from a ZIP archive using decompression software:

Reading the Archive Header: The software first reads metadata in the compressed file (e.g., the ZIP "central directory"). This tells it what files are inside, their original sizes, timestamps, and which compression algorithms were used for each file

Initializing the Decoder: For each compressed file entry, the program sets up the appropriate decoder. For example, if an entry is compressed with DEFLATE, the software will initialize the Huffman trees and sliding window needed to undo the LZ77 compression.

Expanding the Data: The compressed bytes are then streamed through the decoder. In a DEFLATE decompression, this means reading a series of Huffman-coded symbols that represent either literal bytes or references to earlier byte sequences. The algorithm carefully rebuilds the original sequence of bytes by following the instructions in the compressed data (like "copy the sequence you saw 20 bytes ago, 5 bytes long" for LZ77, or output a literal byte) until the full file is reconstructed in memory or on disk.

Writing Out the File: As the software decodes bytes, it writes them out to the output file (or memory buffer). Eventually, once all compressed data has been processed, we have the exact original file restored. The extractor then moves on to the next file in the archive, if any, repeating the process.

Throughout decompression, integrity checks like checksums or CRC values may be verified as well. For instance, ZIP files include a CRC-32 checksum for each file so the software can confirm that the decompressed data matches the original data's checksum, ensuring no corruption occurred.

In summary, decompression is about reading a highly efficient shorthand and expanding it back to the verbose form of the original data. It is the inverse of the compression algorithm. The design of formats like ZIP is such that the decompression (decoding) can be done quickly and with minimal information – all needed dictionaries or tables are either known by the format or embedded in the compressed file. As a result, any compatible decompression tool can extract files compressed by another tool, as long as they support the same algorithm and format.

1.4 Pitfalls and Considerations (Zip Bombs and Beyond)

While file decompression is normally a safe and routine operation, there are some interesting edge cases and pitfalls. One famous example is the "decompression bomb" or zip bomb. This is a maliciously crafted archive file that is very small in size but designed to expand into an enormous amount of data, consuming all your disk space or memory when decompressed. For example, a notorious file named 42.zip is only 42 kilobytes when compressed, but if you unwisely extract it fully, it would balloon to 4.5 petabytes of data! Such extremes are possible by layering compression (e.g., a zip file of zip files, many levels deep) or by exploiting high compression ratio data. When an antivirus or unsuspecting user tries to decompress it, the sheer expansion can overwhelm system resources. Modern extraction tools mitigate this by imposing limits (for instance, refusing to unpack archives that expand beyond a threshold, or scanning for nested compression).

Another consideration is time and memory usage. Decompression generally requires far less CPU than compression (since it's just reversing a deterministic process), but it still needs enough RAM to hold dictionaries or history buffers (e.g., a typical DEFLATE decoder uses a 32KB history buffer). If a file was compressed with a large dictionary or as a solid archive (multiple files compressed as one block), the decompressor needs corresponding resources to handle it. Extraction software must also handle folder structures (creating nested directories as needed when extracting) and be wary of path traversal attacks (ensuring that file paths in the archive don't write outside the target directory). By understanding these challenges, we appreciate that real-world decompression tools do more than just run algorithms – they also implement safety checks to protect users and their systems.

Now that we have a solid grasp of what happens under the hood when files are decompressed, let's put this knowledge into practice. In the next section, we will write our own Java program that takes a compressed .zip file as input and extracts its contents to a destination folder, illustrating the concepts we've discussed.

2. Writing a Java Program to Decompress Files (Unzip)

Building a decompression tool from scratch might sound daunting, but luckily we don't have to implement the compression algorithms ourselves – programming languages like Java come with built-in libraries to handle common formats. Java's standard library provides the java.util.zip package, which includes classes like ZipInputStream for reading ZIP files and ZipEntry for each item inside. We will use these to read a zip archive and write out its files. The approach is straightforward: open the zip file stream, loop through each entry (file/folder) inside, and write the decompressed data to disk. In fact, to unzip a file in Java, we simply read the zip file with ZipInputStream, get each ZipEntry one by one, and use a FileOutputStream to write the entry's data to the output location. We also need to create any directories that the archive entries require (for example, if the archive contains files in subfolders). The outline below summarizes the steps:

  • Open the ZIP file using a FileInputStream, and wrap it in a ZipInputStream to read ZIP entries.
  • For each entry returned by zis.getNextEntry(): check if it's a directory or file. If it's a directory, create the directory in the output. If it's a file, create any necessary parent directories, then read the entry's bytes from the ZipInputStream and write them to a new file on disk.
  • Close the entry and repeat until no more entries. Finally, close the ZIP stream.

2.1 Setting Up the Java Unzip Utility

Let's set up a simple Java class for our utility. We will make a method unzip(String zipFilePath, String destDir) that handles the extraction logic. This method will throw IOException on errors (so we can handle exceptions as needed). Inside, we'll use a byte buffer (e.g., 4KB buffer) to efficiently transfer data from the zip stream to file output. We'll also ensure the output directory exists or create it.

One thing to note is how the code handles directories: when ZipInputStream gives us a ZipEntry, that entry could represent a directory (a folder). We can detect this by entry.isDirectory(). If it's a directory, we call mkdirs() to create it and then skip writing file data. If it's a file, we create the parent directory (in case it doesn't exist) and then stream the data out. Java's java.util.zip classes make this easy by handling the decompression of data internally – we just read from the ZipInputStream like a normal input stream and get decompressed bytes out.

2.2 Java Code Example: Unzipping a File

Below is a complete Java example demonstrating a simple unzip tool. We will then explain how each part of the code works in detail:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

public class UnzipUtility {

public static void unzip(String zipFilePath, String destDir) throws IOException {
File dir = new File(destDir);
if (!dir.exists()) {
dir.mkdirs(); // create output directory if it doesn't exist
}
byte[] buffer = new byte[4096]; // 4KB buffer for data
// Open the ZIP file stream
try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath))) {
ZipEntry entry;
// Iterate over each entry (file/directory) in the zip
while ((entry = zis.getNextEntry()) != null) {
File outFile = new File(destDir, entry.getName());
if (entry.isDirectory()) {
// If the entry is a directory, create it
outFile.mkdirs();
} else {
// Ensure parent directories exist
outFile.getParentFile().mkdirs();
// Write file content
try (FileOutputStream fos = new FileOutputStream(outFile)) {
int len;
while ((len = zis.read(buffer)) > 0) {
fos.write(buffer, 0, len);
}
}
}
// Close this entry and move to the next
zis.closeEntry();
}
}
}

// Example usage
public static void main(String[] args) {
String zipPath = "C:\example\files.zip"; // input zip file
String extractTo = "C:\example\output_folder"; // output directory
try {
unzip(zipPath, extractTo);
System.out.println("Unzip completed successfully!");
} catch (IOException e) {
System.err.println("Error during unzipping: " + e.getMessage());
}
}
}

Let's break down what this code is doing:

We import the necessary classes from java.util.zip and Java I/O. The ZipInputStream class is a special input stream that knows how to read the ZIP file format and decompress the data on the fly. The ZipEntry represents each item in the archive (which could be a file or just a directory placeholder).

In the unzip method, we first ensure the destination directory exists by creating it (using mkdirs() which creates all necessary parent dirs). Then we set up a buffer of 4096 bytes – this is a chunk of memory that will be used to transfer bytes in batches for efficiency.

We use a try-with-resources block to create a ZipInputStream from a FileInputStream of the zip file. This ensures the stream is closed automatically at the end, even if an exception occurs.

We then enter a loop: while ((entry = zis.getNextEntry()) != null). The call getNextEntry() moves to the next file entry in the zip and returns a ZipEntry object for it, or null if we've reached the end of the archive. After calling this, the ZipInputStream is positioned at the beginning of that entry's data (ready to read the uncompressed bytes of that file).

Inside the loop, we construct an output file path by combining the destination directory with the entry's name: new File(destDir, entry.getName()). The entry.getName() might include subdirectory paths (e.g., "docs/readme.txt"), and Java will handle that appropriately in the File constructor.

We then check if (entry.isDirectory()). If true, it means this entry is a directory (its name will typically end in /). In that case, we call outFile.mkdirs() to create the directory (and any necessary parent dirs). We don't need to read any data for directories – they don't have data beyond the name.

If the entry is a file (not a directory), we ensure its parent directory exists: outFile.getParentFile().mkdirs();. This will create the folder structure in case the zip file had files inside a subfolder that wasn't explicitly listed as a separate entry. (Some zip files list directories explicitly, some just infer them from file paths.)

Next we create a FileOutputStream fos to the outFile. We then enter another loop to read from the ZipInputStream (zis.read(buffer)) into our buffer, and write that buffer out via fos.write(). The zis.read() call is actually decompressing the data as it reads – we don't have to call any special decompression function; that's handled inside ZipInputStream. We simply read until EOF for that entry (indicated by zis.read() returning -1, or in our loop, len <= 0 to break).

Once the inner loop finishes (meaning we've read all bytes for the current entry), we close the FileOutputStream (the try-with-resources does this automatically when its block ends). We then call zis.closeEntry() to finish the process for that entry. This isn't always strictly necessary (advancing to the next entry often closes the previous), but it's good practice to explicitly close each entry.

The outer loop then continues to the next entry in the zip, until none are left. At that point, the try-with-resources block ends, automatically closing the ZipInputStream (and the underlying FileInputStream).

If everything went well, the destination directory now contains all the files that were in the zip, in the correct folder structure. We print a success message in the main method. If an error occurred (for example, the zip file path was wrong or a disk issue), our catch block would print an error message.

A few things to note about this code and approach:

  • We assumed the input is a ZIP file. The ZipInputStream only understands ZIP format (which includes files often ending in .zip, .jar, .war, etc.). For other compression formats (tar, rar, 7z, gz, etc.), we would need different libraries or classes. For example, Java has GZIPInputStream for .gz files, but for .rar or .7z you’d need third-party libraries.
  • We chose a 4KB buffer, which is a common chunk size. You could use 1KB or 8KB; it generally won't make a huge difference for typical files. The idea is to balance not using too much memory with not calling the underlying read for every single byte (which would be slower).
  • The code handles nested directories by creating parent folders as needed. This ensures that if the ZIP entry name has subfolders (like "folder1/folder2/file.txt"), those folders will be created before writing the file.
  • We use try-with-resources to manage closing streams. This is the recommended pattern in Java because it ensures streams are closed even if an exception happens mid-way. If you don't use this, you'd have to manually close in a finally block.

This simple utility demonstrates the core of how decompression software operates for archives: iterating through entries and writing out data. Real-world tools add more bells and whistles (like a progress bar, error handling for corrupt archives, support for encryption/passwords, etc.), but the backbone is exactly as shown.

Output Example: If you run this program on a zip file (say, "files.zip" contains doc.txt and images/picture.png), it will create an output_folder with those contents. The console might print:

Unzip completed successfully!

And you'd find output_folder/doc.txt and output_folder/images/picture.png extracted on your drive. Each file's content will match the original exactly, proving the decompression worked.

An illustration of a "zip bomb" — a tiny compressed file that expands into a massive amount of data, overwhelming system resources. While our Java program is simple, it's powerful enough to handle most ZIP files you throw at it. However, be mindful of the earlier warning about zip bombs and resource limits. If you tried to unzip something like 42.zip with this code, you would quickly exhaust your disk space or memory, since our code (as written) doesn't defend against that. In a production-grade tool, you might want to guard against unusually large entries or total uncompressed size. For instance, you could count bytes as you write and abort if the size exceeds a safe limit, or refuse to unpack archives with deeply nested entries. Java's zip classes don't automatically prevent these scenarios – it's up to the application logic to impose limits (e.g., stop after extracting a certain number of bytes).

Despite these considerations, writing your own decompression logic gives you a lot of insight. Not only do we see how to use the Java API to unzip files, but it also cements the understanding of what decompression is doing: reading in compressed bytes and writing out the original data. Each ZipEntry we processed corresponds to the decoder taking a stream of compressed data and turning it back into the file's bytes, just as we discussed in theory.

We started by exploring how file compression reduces file sizes by eliminating redundancy and how decompression restores the original data. We examined common algorithms like LZ and Huffman coding that make this possible. Then we put theory into practice by coding a Java-based unzip utility, using Java's built-in libraries to handle the heavy lifting of actual decompression. With this knowledge, you should have both an appreciation for the sophistication of compression software and a practical understanding of how to manipulate ZIP files in your own programs.

If you have any questions about file decompression or the Java code example, feel free to leave a comment below!

Read more at : Techniques for Understanding File Decompression and Writing a Java Unzip Program

More from this blog

T

tuanh.net

540 posts

Are you ready to elevate your Java, OOP, Spring, and DevOps skills? Look no further!