# HyperLogLog

_References: Original algorithm: Flajolet, Fusy, Gandouet & Meunier: "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm".

With some of the modifications from: Heule, Nunkesser & Hall: "HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm"._

## What it is

An HyperLogLog is a very memory-efficient datastructure that keeps track of approximately how many distinct, hashable elements it's seen. A default HLL uses 16 KiB of memory and can return reliable estimates of up to some very large cardinality (on the order of 2^59 distinct elements).

This estimate of cardinality has a median error of 0.5 %, and a 99 % chance of having an error less than 2.5 %, when the cardinality estimate is >1024. To accurately keep track of datasets smaller than 1024, use another datastructure like a `Set`

.

This implementation is not optimally memory-efficient, but it is fast. More advanced tricks can be found in the Heule et al. paper linked above, which increases accuracy and lowers memory usage for small N.

The HLLs are not guaranteed to be threadsafe. To parallelize this implementation of HLL, each process/thread must operate on independent HLLs. These can then be efficiently merged using `union`

or `union!`

(or `∪`

). This is much faster than using atomic operations.

## Usage example

In this example, let's say I'm given 133 million fastq-sequences from a large sequencing project of Neanderthal bones. 133 million reads sounds like a lot, so, I'm worried that the lab folk went a little overboard on the PCR and the same reads are present in many copies. Hence, I want to know how many unique reads there are. I don't care that the HyperLogLog doesn't fit in the cache, so I'll crank the `P`

parameter up to 18 and spend the 256 KiB memory to maximize accuracy:

```
hll = HyperLogLog{18}()
reader = FASTQ.Reader(open("huge_file.fastq", "r"))
for record in reader
seq = sequence(record) # we want a hashable DNA sequence
push!(hll, seq)
end
println("Number of distinct FASTQ-sequences: ", length(hll))
```

## Interface

### Construction

The accuracy of a HLL depends on its `P`

parameter. You can construct a HLL with its `P`

parameter directly:

```
julia> hll = HyperLogLog{14}()
HyperLogLog{14}()
```

A P-value of 14 is considered default, so if you don't pass the parameter, 14 is assumed as default:

```
julia> HyperLogLog{14}() == HyperLogLog()
true
```

### Central functions

`Base.push!`

— Method`push!(hll::HyperLogLog, items...)`

Add each item to the HLL. This has no effect if the HLL has seen the items before.

**Examples**

```
julia> a = HyperLogLog{14}(); push!(a, 1,2,3,4,5,6,7,8,9); length(a)
9
```

`Base.length`

— Method`length(hll::HyperLogLog{Precision})`

Estimate the number of distinct elements the HLL has seen. The error depends on the Precision parameter. This has low absolute rror when the estimate is small, and low relative error when the estimate is high.

**Examples**

```
julia> a = HyperLogLog{14}(); push!(a, 1,2,3,4,5,6,7,8); length(a)
9
```

### Misc functions

HyperLogLog supports the following operations, which have no HyperLogLog-specific docstring because they behave as stated in the documentation in Base:

```
Base.copy!
Base.copy
Base.sizeof # This one includes the underlying array
```

`Base.isempty`

— Method`isempty(x::HyperLogLog)`

Return `true`

if the HLL has not seen any elements, `false`

otherwise. This is guaranteed to be correct, and so can be `true`

even when length(x) > 0.

**Examples**

```
julia> a = HyperLogLog{14}(); (length(a), isempty(a))
(1, true)
>julia push!(a, 1); (length(a), isempty(a))
(1, false)
```

`Base.empty!`

— Method`empty!(x::HyperLogLog)`

Reset the HLL to its beginning state (i.e. "deleting" all elements from the HLL), returning it.

**Examples**

```
julia> empty!(a); length(a) # should return approximately 0
1
```

`Base.union!`

— Method`union!(dest::HyperLogLog{P}, src::HyperLogLog{P})`

Overwrite `dest`

with the same result as `union(dest, src)`

, returning `dest`

.

**Examples**

```
julia> # length(c) ≥ length(b) is not guaranteed, but overwhelmingly likely
julia> c = union!(a, b); c === a && length(c) ≥ length(b)
true
```

`Base.union`

— Method`union(x::HyperLogLog{P}, y::HyperLogLog{P})`

Create a new HLL identical to an HLL which has seen the union of the elements `x`

and `y`

has seen.

**Examples**

```
julia> # That c is longer than a or b is not guaranteed, but overwhelmingly likely
julia> c = union(a, b); length(c) ≥ length(a) && length(c) ≥ length(b)
true
```