-
Notifications
You must be signed in to change notification settings - Fork 171
simplify the IP benchmark #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ran it on my machine (Apple M1, clang 21) Also a simple_bench which just measures throughput: |
|
Actualy I inspected the assembly of the benchmark. Turns out it was not able to inline function calls. Compiling with |
|
@shikharish The libraries are header libraries (both counters and fast_float) so |
|
I have pushed a memcpy measurement. So if you run You can independently measure it with an entirely different program: try this C++ file, just save it, compile it with `-O3` and run it.#include <iostream>
#include <chrono>
#include <cstring>
#include <cstdlib>
#include <memory>
int main(int argc, char* argv[]) {
const size_t element_count = 15000;
const size_t element_size = 16;
const size_t buffer_size = element_count * element_size; // 240000 bytes
unsigned int iterations = 1000;
if (argc > 1) {
iterations = std::atoi(argv[1]);
}
std::unique_ptr<char[]> src = std::make_unique<char[]>(buffer_size);
std::unique_ptr<char[]> dst = std::make_unique<char[]>(buffer_size);
// Initialize source buffer (arbitrary data)
for (size_t i = 0; i < buffer_size; ++i) {
src[i] = static_cast<char>(i);
}
// Warm-up: perform a few copies to fill caches
for (unsigned int i = 0; i < 10; ++i) {
std::memcpy(dst.get(), src.get(), buffer_size);
}
volatile char sink;
// Timed measurement
auto start = std::chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < iterations; ++i) {
std::memcpy(dst.get(), src.get(), buffer_size);
}
sink += dst[0]; // Prevent optimization
auto end = std::chrono::high_resolution_clock::now();
auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "duration " << duration_ns << " ns\n";
double duration_sec = duration_ns / 1e9;
double bytes_copied = static_cast<double>(buffer_size) * iterations;
double speed_gbps = bytes_copied / duration_ns;
std::cout << "Buffer size: " << buffer_size << " bytes (" << buffer_size / 1024.0 << " KiB)\n";
std::cout << "Iterations: " << iterations << "\n";
std::cout << "Time: " << duration_sec << " seconds\n";
std::cout << "Memcpy speed: " << speed_gbps << " GB/s\n";
return EXIT_SUCCESS;
}
Now, if you are getting that the memcpy speed measured the naive way is faster than the approach using the new Inlining might be a concern but I have ensured in the latest pushed that pretty much everything can get inlined. |
|
The memcpy test runs with similar speeds. But still there is a big difference in the actual benchmarks when I compile with Apple LLVM(clang 14) |
|
The issue is simply that counters::bench code is complex, has templating and compiler is not able to optimize properly. and got much faster results, matching with my simple_bench: |
|
the macro I defined here: |
|
Even better. I just removed all the compile-time template <typename Func>
COUNTERS_FORCE_INLINE void call_ntimes_runtime(Func &&func, size_t M) {
for (size_t i = 0; i < M; i++) {
func();
}
}Same results as It works better for this case I don't know if it will for other cases. Can open a PR in the counters repo if you want. |
|
@shikharish Why are you using LLVM 14? The current Apple LLVM is 17.
If the issue is that the benchmark framework introduces overhead is therefore likely to produce pessimistic measures, how do you account for the fact that the fastest function possible over this data size (a memcpy) does not run slower ?
I do not understand this statement. What do you mean by optimize properly?
For short functions, this will include non-trivial loop overhead.
This benchmark code that you are proposing is not robust to short functions. In the case of the functions that we are benchmarking in bench_ip, the Thus we call // Compile-time specialized bench implementation for a fixed inner repeat M.
template <size_t M, class Function>
event_aggregate bench_impl(Function &&function, size_t min_repeat,
size_t min_time_ns, size_t max_repeat) {
static thread_local event_collector collector;
auto fn = std::forward<Function>(function);
size_t N = min_repeat;
if (N == 0)
N = 1;
// Warm-up
event_aggregate warm_aggregate{};
for (size_t i = 0; i < N; i++) {
collector.start();
call_ntimes<M>(fn);
event_count allocate_count = collector.end();
warm_aggregate << allocate_count;
if ((i + 1 == N) && (warm_aggregate.total_elapsed_ns() < min_time_ns) &&
(N < max_repeat)) {
N *= 10;
}
}
// Measurement
event_aggregate aggregate{};
for (size_t i = 0; i < N; i++) {
collector.start();
call_ntimes<M>(fn);
event_count allocate_count = collector.end();
aggregate << allocate_count;
}
aggregate /= M;
aggregate.inner_count = M;
return aggregate;
}But with template <std::size_t M, typename Func> void call_ntimes(Func &&func) {
if constexpr (M == 1) {
func();
return;
}
...If you compile with optimizations, this will surely get inlined.
Statements like 'it works better' or 'compiler is not able to optimize properly' are not helpful to me. For example, if you say that the compiler does a poor job (which is entirely possible) then I expect an analysis of the produced ASM for example. The problem is that we end up with statements such that Now, these things can get tricky, in the 2 minutes I spent in lldb, I get the impression that in the benchmark here, neither std::from_chars nor fastfloat::from_chars are inlined. So we have loops with function calls. This is fair. Good. Now, you are giving me quite different answers but you are not explaining the difference. Why is fastfloat::from_chars so much faster in your small benchmarks than in this benchmark with std::from_chars? A possibility, of course, is that in your benchmark, fastfloat::from_chars is not inlined, whereas std::from_chars is not. Then, of course, the would not be a direct comparison. In such a case, the performance difference might then be entirely due to the inlining and so we would be measuring the benefits of inlining. That's fine, but we have to be clear about what we are measuring. |
|
Let me be clearer, we have not optimized in any way the So I am slightly slower than the standard library. You are getting that fastfloat is already nearly 3x faster than the standard library.
Thus I believe that your results are wrong. Now, I'd be happy to be proven wrong, but my stance is the reasonable one. It is very difficult to be 2x to 3x faster than the standard library. Possible, certainly, but rarely by accident and without tradeoff. |
The benchmarks I posted above were after cherry-picking my commits from the "parsing uint8_t" PR. I apologize I didn't mention that before.
I am using MacOS 13.x. I belive I would have to do a system update(which I am avoiding) to get LLVM 17.
Yes what I meant was: The following benchmarks were ran without cherry-picking any commits, that is, the current fast_float::from_chars implementation without any So we get this: And the simple_bench is inlining both. So inlined And after I add my commits from the other PR, fast_float(inlined) throughput shoots to 0.82. |
It is strange but it is what I am seeing on my machine: Inspecting the assembly and turns out I am not really sure why this happens :/ |
|
@lemire Please have a look : #1 This adds a stronger hint to inline all functions being called inside that function(as well as the function itself). |
|
Please see #352 |
The IP benchmark was a bit complicated. I simplified the code a bit. I have added a control where you can see the speed without number parsing (so
just_seek_ip_end).