HyperLogLog

_References: Original algorithm: Flajolet, Fusy, Gandouet & Meunier: "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm".

With some of the modifications from: Heule, Nunkesser & Hall: "HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm"._


What it is

An HyperLogLog is a very memory-efficient datastructure that keeps track of approximately how many distinct, hashable elements it's seen. A default HLL uses 16 KiB of memory and can return reliable estimates of up to some very large cardinality (on the order of 2^59 distinct elements).

This estimate of cardinality has a median error of 0.5 %, and a 99 % chance of having an error less than 2.5 %, when the cardinality estimate is >1024. To accurately keep track of datasets smaller than 1024, use another datastructure like a Set.

This implementation is not optimally memory-efficient, but it is fast. More advanced tricks can be found in the Heule et al. paper linked above, which increases accuracy and lowers memory usage for small N.

The HLLs are not guaranteed to be threadsafe. To parallelize this implementation of HLL, each process/thread must operate on independent HLLs. These can then be efficiently merged using union or union! (or ). This is much faster than using atomic operations.

Usage example

In this example, let's say I'm given 133 million fastq-sequences from a large sequencing project of Neanderthal bones. 133 million reads sounds like a lot, so, I'm worried that the lab folk went a little overboard on the PCR and the same reads are present in many copies. Hence, I want to know how many unique reads there are. I don't care that the HyperLogLog doesn't fit in the cache, so I'll crank the P parameter up to 18 and spend the 256 KiB memory to maximize accuracy:

hll = HyperLogLog{18}()
reader = FASTQ.Reader(open("huge_file.fastq", "r"))
for record in reader
    seq = sequence(record) # we want a hashable DNA sequence    
    push!(hll, seq)
end
println("Number of distinct FASTQ-sequences: ", length(hll))

Interface

Construction

The accuracy of a HLL depends on its P parameter. You can construct a HLL with its P parameter directly:

julia> hll = HyperLogLog{14}()
HyperLogLog{14}()

A P-value of 14 is considered default, so if you don't pass the parameter, 14 is assumed as default:

julia> HyperLogLog{14}() == HyperLogLog()
true

Central functions

Base.push!Method
push!(hll::HyperLogLog, items...)

Add each item to the HLL. This has no effect if the HLL has seen the items before.

Examples

julia> a = HyperLogLog{14}(); push!(a, 1,2,3,4,5,6,7,8,9); length(a)
9
source
Base.lengthMethod
length(hll::HyperLogLog{Precision})

Estimate the number of distinct elements the HLL has seen. The error depends on the Precision parameter. This has low absolute rror when the estimate is small, and low relative error when the estimate is high.

Examples

julia> a = HyperLogLog{14}(); push!(a, 1,2,3,4,5,6,7,8); length(a)
9
source

Misc functions

Note

HyperLogLog supports the following operations, which have no HyperLogLog-specific docstring because they behave as stated in the documentation in Base:

Base.copy!
Base.copy
Base.sizeof # This one includes the underlying array
Base.isemptyMethod
isempty(x::HyperLogLog)

Return true if the HLL has not seen any elements, false otherwise. This is guaranteed to be correct, and so can be true even when length(x) > 0.

Examples

julia> a = HyperLogLog{14}(); (length(a), isempty(a))
(1, true)

>julia push!(a, 1); (length(a), isempty(a))
(1, false)
source
Base.empty!Method
empty!(x::HyperLogLog)

Reset the HLL to its beginning state (i.e. "deleting" all elements from the HLL), returning it.

Examples

julia> empty!(a); length(a) # should return approximately 0
1
source
Base.union!Method
union!(dest::HyperLogLog{P}, src::HyperLogLog{P})

Overwrite dest with the same result as union(dest, src), returning dest.

Examples

julia> # length(c) ≥ length(b) is not guaranteed, but overwhelmingly likely
julia> c = union!(a, b); c === a && length(c) ≥ length(b)
true
source
Base.unionMethod
union(x::HyperLogLog{P}, y::HyperLogLog{P})

Create a new HLL identical to an HLL which has seen the union of the elements x and y has seen.

Examples

julia> # That c is longer than a or b is not guaranteed, but overwhelmingly likely
julia> c = union(a, b); length(c) ≥ length(a) && length(c) ≥ length(b)
true
source