• Forum has been upgraded, all links, images, etc are as they were. Please see Official Announcements for more information

Dash Labs - GPU Accelerator Design Specifications

eduffield222

New member
We desire to make a fully functional prototype masternode, which is armed with a GPU and software capable of offloading the transaction elipicial curve cryptography checks from the dash-core daemon to the dash-core-gpu implementation.

This is going to consist of a few pieces

- GPU Implementation of EC cryptographic code
- A Bridge implementation, with lock-safe / multi-threaded implementation within the block-checking code of dash-core
- Configuration data to be added to dash-core to allow switching from stand-alone blockchecking to accelerated block checking.

Please help to determine the best implementation strategy to add this to dash-core. We can have the conversation here and rework the implementation strategy to suit the requirements discovered below.

Thanks for the help!
 
We should start by profiling a single CPU core processing blocks of increasing size with various script types.
That will give us useful baseline measurements, and optimistic figures for multicore scaling.
I guess we're most interested in gauging what hardware is required for what tx throughput.

Like you say, the next step is to multi-thread the block-checking code, aiming for near-linear scaling with
the number of cores (maybe a bit more with hyper-threading, then likely less than linear with overheads).
One complication here could be tx-chaining through UTXOs within the block, as those chains would need
to be serial processed - the tx'es may need to be filtered to extract independent work for parallel processing,
either on-the-fly or by preprocessing into sub-block batches.

From what Mr Wright said, it sounds like nchain have taken this step and tested on intel's Xeon Phi
(aka Knight's Landing 68 core x86). I can't recall the throughput figure he mentioned (it was impressive).
I don't know if that work is open source. Is anyone his friend?

Whatever, we could replicate that work (with ASU assist?) and test on more modest multicore.
That will give us multicore scaling figures and more realistic projections for throughput.

The next step, to offload the EC crypto to an accelerator, needs pretty large modifications.
We want to estimate what the cost / benefit for projected loads. Maybe strong multicore is sufficient.
My feeling is that GPU acceleration should be a win. It ain't easy. But we don't do these things...

Let's say that 80% of work is in `secp256k1_ecdsa_verify` and that it's suitable for GPU.
Then, say that it is possible to accelerate by 8x (including cost of CPU <-> GPU comms).
The total throughput improvement is then around 3x.

I have an idea of what we'd need to do and can follow up here or elsewhere.
 
We should start by profiling a single CPU core processing blocks of increasing size with various script types.
That will give us useful baseline measurements, and optimistic figures for multicore scaling.
I guess we're most interested in gauging what hardware is required for what tx throughput.

Like you say, the next step is to multi-thread the block-checking code, aiming for near-linear scaling with
the number of cores (maybe a bit more with hyper-threading, then likely less than linear with overheads).
One complication here could be tx-chaining through UTXOs within the block, as those chains would need
to be serial processed - the tx'es may need to be filtered to extract independent work for parallel processing,
either on-the-fly or by preprocessing into sub-block batches.

From what Mr Wright said, it sounds like nchain have taken this step and tested on intel's Xeon Phi
(aka Knight's Landing 68 core x86). I can't recall the throughput figure he mentioned (it was impressive).
I don't know if that work is open source. Is anyone his friend?

Whatever, we could replicate that work (with ASU assist?) and test on more modest multicore.
That will give us multicore scaling figures and more realistic projections for throughput.

The next step, to offload the EC crypto to an accelerator, needs pretty large modifications.
We want to estimate what the cost / benefit for projected loads. Maybe strong multicore is sufficient.
My feeling is that GPU acceleration should be a win. It ain't easy. But we don't do these things...

Let's say that 80% of work is in `secp256k1_ecdsa_verify` and that it's suitable for GPU.
Then, say that it is possible to accelerate by 8x (including cost of CPU <-> GPU comms).
The total throughput improvement is then around 3x.

I have an idea of what we'd need to do and can follow up here or elsewhere.

Thanks, bro!
 
Back
Top