• ×
    Information
    Need Windows 11 help?
    Check documents on compatibility, FAQs, upgrade information and available fixes.
    Windows 11 Support Center.
  • post a message
  • ×
    Information
    Need Windows 11 help?
    Check documents on compatibility, FAQs, upgrade information and available fixes.
    Windows 11 Support Center.
  • post a message
Guidelines
Seize the moment! nominate yourself or a tech enthusiast you admire & join the HP Community Experts!
Check out our WINDOWS 11 Support Center info about: OPTIMIZATION, KNOWN ISSUES, FAQs, VIDEOS AND MORE.
HP Recommended

Hi all,

I'm working with a large dataset (around 20M rows) containing event logs with timestamps, and I need to analyze hourly activity trends — e.g., peak usage hours, hourly averages, and patterns over weekdays vs weekends.

I’m currently using Pandas for grouping and visualization, but it’s starting to feel sluggish. Are there better tools or techniques for handling this at scale (maybe Dask, DuckDB, or something else)? I’m on an HP ZBook with 64GB RAM and an NVIDIA GPU if that helps.

Any suggestions for improving speed or best practices for hour-based grouping/visualization would be appreciated.

Thanks!
Jhonn Mick

1 REPLY 1
HP Recommended

Hi @jhonnmarie,

For handling large datasets efficiently, especially with your hardware setup, there are several tools and techniques you could consider:

 

1. Dask
Dask is designed to parallelize operations across cores, which can help you handle larger-than-memory datasets. It integrates well with Pandas, allowing you to run your existing code with minimal changes while improving performance:

 

Parallelizes Pandas operations across multiple cores or distributed systems.
Easily scale calculations and leverage your CPU effectively.
Works well with your HP ZBook setup by using the full potential of your machine's resources.


2. DuckDB
DuckDB is a fast, in-process SQL OLAP database management system with excellent performance for analytical workloads:

 

Efficiently handles large-scale data analysis tasks with a focus on speed.
Supports Pandas DataFrame input/output, making integration straightforward.
Optimized for analytical queries, which should benefit your event log analysis tasks.


3. RapidsAI (GPU-Accelerated Data Science)
With an NVIDIA GPU, consider using RapidsAI, which provides a suite of open-source software for executing end-to-end data science and analytics pipelines entirely on GPUs:

 

Leverages the GPU to accelerate DataFrame and graph analytics.
Built on Apache Arrow, providing interoperability with Pandas, and utilizes libraries like cuDF, which are similar to Pandas but run on the GPU.
Drastically reduces computation times for data preparation, analytics, and machine learning tasks.


4. PostgreSQL with TimescaleDB Extension
For more complex time-series queries, using a database like PostgreSQL with the TimescaleDB extension can be beneficial:

 

It optimizes the storage and querying of time-series data, which fits well with your usage pattern.
Handles time-based aggregations more effectively than standard SQL.


Recommendations
Given your setup with 64GB RAM and an NVIDIA GPU, RapidsAI can be particularly valuable because of its high performance with GPU acceleration. For ease of transition and visualization, stick with tools that have simple Pandas integrations like Dask or possibly integrate DuckDB for SQL-like operations.

I am an HP Employee.
Click Helpful = Yes to say Thank You.
Question / Concern Answered, Click "Accept as Solution"
† The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the <a href="https://www8.hp.com/us/en/terms-of-use.html" class="udrlinesmall">Terms of Use</a> and <a href="/t5/custom/page/page-id/hp.rulespage" class="udrlinesmall"> Rules of Participation</a>.