dedup

Content-address object bodies by their hash so identical content is stored only once. Re-uploading the same bytes skips the upload, and copies share a single stored blob. No native dependencies; works on any adapter that supports metadata.

The built-in dedup() plugin stores each distinct body once. On upload it hashes the bytes (SHA-256), writes them a single time to a content-addressed blob under a store prefix (.dedup/ by default), and leaves a tiny pointer at your logical key. Upload the same content again — under any key — and the byte upload is skipped; only the pointer is written.

import { createFiles } from "files-sdk";
import { s3 } from "files-sdk/s3";
import { dedup } from "files-sdk/dedup";

const files = createFiles({
  adapter: s3({ bucket: "uploads" }),
  plugins: [dedup()],
});

await files.upload("a.png", bytes);
await files.upload("b.png", bytes); // same content — no second byte upload
await files.copy("a.png", "c.png"); // shares the one stored blob

How it works

The logical key (a.png) becomes an empty object whose metadata records the content hash; the bytes live at .dedup/<sha256>. Two keys with identical content point at the same blob, so the content is stored once no matter how many keys reference it.

  • upload hashes the body, writes the blob only if that hash isn't already stored (exists()), then writes the pointer.
  • download follows the pointer to the blob and returns it under your key — ranges included, because blobs are stored verbatim (unlike compression()).
  • head / list report the logical content size with the internal fields stripped, without fetching the blob. List hides the blob store from normal listings.
  • copy / move relocate the small pointer, so duplicating a de-duplicated file is near-free and the copy shares the original blob.

Bulk upload([...]) / download([...]) are de-duplicated per item. Objects without this plugin's marker — pre-existing, or written by another tool — pass straight through on read, so it's safe to enable on a bucket that already has data.

Options

OptionDefaultWhat it does
prefix".dedup"Where the content-addressed blobs live. Hidden from list().

Objects under prefix are never themselves de-duplicated and are hidden from list() (unless you list within the prefix). Don't store your own data there.

Ordering

Put dedup() first, before any body-transforming plugin. Encrypted bytes don't de-duplicate — a random per-object key makes identical inputs encrypt to different bytes — so de-duplication has to see the original content:

plugins: [dedup(), compression(), encryption(key)];

With this order the one stored blob is itself compressed and encrypted, and reads unwind the onion automatically (decrypt → decompress → follow the pointer).

Things to keep in mind

  • It buffers the whole body to hash it, so — like compression() — it's unsuitable for unknown-length streaming uploads and resumable uploads.
  • Reads cost a second fetch. A download reads the pointer, then the blob; a ranged download does a head first. head and list add nothing — they only read the pointer.
  • It needs adapter metadata support. The hash round-trips through metadata, the same gate a direct metadata upload hits.
  • url() and signedUploadUrl() fail closed. A presigned GET would hand out the empty pointer, not the content, and a presigned PUT would write directly and bypass content-addressing — so both throw. Download through the Files instance instead.
  • Blobs aren't garbage-collected. delete (and an overwrite) drop the pointer but leave the content addressed — so it's reused if the same bytes reappear. Reclaim unreferenced blobs with a storage lifecycle rule or a periodic sweep.

On this page