Count-min sketch
What it is
A count-min sketch is a probabilistic counter of hashable objects. Similar to bloom filters in structure, they exhibit false positives (may overcount), but no false negatives (never undercounts). Count-min sketches have two parameters: Length and width. Its accuracy is described by two numbers: ϵ describing its error and δ describing its probability of error. More rigorously, if N is the true count of an event, E is the estimate given by a sketch and T the total count of items in the sketch, E ≤ N + Tϵ with probability (1 - δ). The parameters of a sketch can be determined by length = 2/ϵ
and depth = -log(δ)/log(2)
.
Adding (add!
)
Count-min sketches has infinite capacity, but adding elements steadily increases the probability of a miscounting other values. Add time is proportional to depth
.
Querying (getindex
)
Querying time is proportional to depth
. It has a certain risk of reporting too high counts, but no risk of undercounting.
Subtracting/deletion.
Count-min sketches do not support subtracting or deleting.
Usage example
I don't know anything about astronomy, but indulge me: Say I want to count how many times different stars have been observed. Sightings are available in a public database. There's about 1.4 billion known stars, with about 5 billion sightings. I have 10 GB of available RAM to balance ϵ
and δ
.
First, I decide that a UInt8 (maximum count of 255) is fine - I don't care that Polaris have been observed 100,000 times.
Most stars have a low count, so I really care about miscounts not being too large, and I care less about them being frequent or infrequent, so I should minimize ϵ
rahter than δ
, which means maximizing length rather than depth. So if I pick width = 4 and length = 2.5e9. This means the probability of miscounting with 4*2/10e4 * 5e9 = 4
with a probability of exp(-4*log(2)) 6.25%
:
sketch = CountMinSketch(2.5e9, 4)
for star in catalogue
push!(sketch, star) # same as adding one using add!
end
# How many times have a particular star observed?
println(sketch["HD 143183"])
Interface
Construction
At the moment, count-min sketches can only be constructed from the parameters length
and width
:
sketch = CountMinSketch{UInt8}(10000, 5)
Although the default eltype of count-min sketches are UInt8
, so you can avoid specifying that:
sketch = CountMinSketch(10000, 5)
Central functions
Base.getindex
— Methodgetindex(sketch::CountMinSketch, item)
Get the estimated count of item
. This is never underestimated, but may be overestimated.
Probably.add!
— Methodadd!(sketch::CountMinSketch, val, count)
Add count
number of val
to the sketch. For increased speed, let count
be of the same type as eltype(sketch)
.
Examples
julia> sketch = CountMinSketch(1 << 24, 4);
julia> add!(sketch, "hello", eltype(sketch)(5));
julia> sketch["hello"]
5
Base.push!
— Methodpush!(sketch::CountMinSketch, val)
Add val
to the sketch once.
Examples
julia> sketch = CountMinSketch(1 << 24, 4);
julia> push!(sketch, "hello");
julia> sketch["hello"]
1
Misc functions
Count-min sketches support the following operations, which have no specific docstring because they behave as stated in the documentation in Base:
Base.eltype
Base.copy!
Base.copy
Base.sizeof # This one includes the underlying array
Base.:+
— Method+(x::CountMinSketch, y::CountMinSketch)
Add two count-min sketches together. Will not work if x
and y
do not share parameters T
, length
and width
. The result will be a sketch with the summed counts of the two input sketches.
Examples
julia> x, y = CountMinSketch(1000, 4), CountMinSketch(1000, 4);
julia> add!(x, "hello", 4); add!(y, "hello", 19);
julia> z = x + y; Int(z["hello"])
23
Probably.fprof
— Methodfprof(sketch::CountMinSketch)
Estimate the probability of miscounting an element in the sketch.
Base.haskey
— Methodhaskey(sketch::CountMinSketch)
Check if sketch[val] > 0.
Base.empty!
— Methodempty!(sketch::CountMinSketch)
Reset counts of all items to zero, returning the sketch to initial state.
Base.isempty
— Methodisempty(sketch::CountMinSketch)
Check if no items have been added to the sketch.