Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

From Tariffs to TTL: Cache Eviction for Data Engineers

The Rooster Podlog

Mar 21, 2025

Transcript

“There are only two hard things in Computer Science: cache invalidation and naming things.” - Phil Karlton.

Cache eviction is the process of removing data from a cache memory system when it reaches capacity to make room for new data. It's essential because cache memory is limited and significantly more expensive than primary storage. Without eviction strategies, caches would fill up and become ineffective, defeating their purpose of improving system performance through faster data access.

Understanding Cache Eviction Through Trump's Trade War

Least Recently Used (LRU)

Removes the item that has not been used for the longest time. This strategy assumes that items used recently are more likely to be used again soon.

In LRU cache eviction, items that haven't been accessed for the longest time are removed first.

Imagine Trump maintaining trade disputes with multiple countries. When he needs resources for a new trade battle, he abandons negotiations with countries he hasn't engaged with recently (like Estonia) to focus on frequent adversaries (like China).

Provides excellent performance for workloads with temporal locality, optimizing for recently accessed data. This typically offers good hit rates for user session data and interactive applications but may perform poorly during sequential scans that can flush important data.

When to Use: Ideal for workloads with temporal locality where recently accessed items are likely to be needed again soon, such as web browsing, session management, and most interactive applications.

When Not to Use: Avoid with scanning workloads or when access patterns lack temporal locality, as these can flush important but infrequently accessed data from the cache.

# Function that caches results using LRU strategy

@lru_cache(maxsize=1000)

def calculate_metrics(product_id):

    return df.filter(col("product_id") == product_id).groupBy("region").agg(sum("revenue"))

# Apply function to DataFrame

result_df = df.select("product_id").distinct().rdd.flatMap(

    lambda x: [(x[0], calculate_metrics(x[0]))]

).toDF(["product_id", "regional_metrics"])

Real-World Use Cases:

- Customer Data Platform: Caching frequent customer profiles for real-time personalization, evicting profiles that haven't been viewed recently

- API Gateway: Caching responses for frequently accessed endpoints while evicting results for rarely accessed paths

Least Frequently Used (LFU)

Evicts the item that has been accessed the least number of times. It prioritizes keeping frequently accessed items in the cache.

LFU removes items that have been accessed the least number of times, tracking item popularity.

Trump allocates most negotiation resources to countries with the highest trade volumes (China, EU) and abandons disputes with minimal trading partners (Liechtenstein) when capacity is needed.

Excels in scenarios with stable popularity distributions, like content delivery networks, by keeping frequently accessed items. This improves hit rates for consistently popular items but can struggle with evolving access patterns and introduces computational overhead for tracking frequencies.

When to Use: Best for workloads with stable popularity distributions where certain items are consistently accessed more than others, such as content delivery networks or reference data lookups.

When Not to Use: Avoid when popularity patterns change rapidly or when new items need time to build up frequency counts, as this creates a "cold start" problem.

# Track access frequency of data items



frequency_df = df.groupBy("item_id").count().withColumnRenamed("count", "frequency")



# Join with original data and prioritize high-frequency items



cached_df = df.join(frequency_df, "item_id")

              .orderBy(col("frequency").desc())

              .limit(10000)  # cache size

Real-World Use Cases:

- Recommendation Engine: Caching popular product data that receives frequent lookups

- Data Warehouse Query: Caching frequently accessed dimension tables to speed up common joins

Most Recently Used (MRU)

Deletes the most recently accessed item first. This approach is useful in scenarios where older data is more valuable than newer data.

MRU evicts the most recently accessed items first, ideal when older items are more valuable.

After signing a trade deal with China, Trump immediately shifts focus away from China (the most recently "used" relationship) to pursue new disputes, assuming the China issue is temporarily resolved

Offers simple implementation with low overhead, making it suitable for streaming workloads and sequential data access. However, it may result in lower hit rates for workloads with complex access patterns since it doesn't consider actual usage.

When to Use: Appropriate for access patterns where recently accessed items are unlikely to be needed again soon, such as sequential scans or cycling through large datasets.

When Not to Use: Avoid for most typical workloads that exhibit temporal locality, as MRU would counterintuitively remove the items most likely to be accessed again.

# Track last access time for each item



access_df = df.groupBy("item_id").agg(max("access_time").alias("last_access"))



# Evict most recently accessed items when cache fills


cached_df = df.join(access_df, "item_id")              .orderBy(col("last_access").asc())  # Keep oldest accessed items.limit(10000)

Real-World Use Cases:

- Time-Series Analysis: Caching historical data while evicting recently processed data points

- Anomaly Detection: Maintaining baseline historical patterns while evicting recent data that may contain anomalies

Time to Live (TTL)

Each item in the cache has a predefined expiration time. Once the time elapses, the item is removed regardless of usage frequency or recency.

TTL assigns an expiration time to cached items, automatically removing them when that time elapses.

Trump implements tariffs with sunset provisions—25% tariffs on EU steel for 180 days. After the predetermined period, the policy expires automatically regardless of attention it's receiving.

When to Use: Perfect for time-sensitive data that becomes stale after a certain period, such as market prices, weather data, or any information that requires regular refreshing.

When Not to Use: Avoid when data validity isn't time-dependent or when access patterns are more important than age for determining cache value.

from pyspark.sql.functions import current_timestamp, col

import time

# Define a cache with TTL functionality

class TTLCache:

    def __init__(self, spark_session, ttl_seconds=3600):

        self.spark = spark_session

        self.cache_data = {}

        self.ttl_seconds = ttl_seconds

    def get(self, key):

        if key in self.cache_data:

            entry = self.cache_data[key]

            # Check if entry has expired

            if time.time() - entry['timestamp'] < self.ttl_seconds:

                return entry['data']

            else:

                # Remove expired entry

                del self.cache_data[key]

        return None

    def put(self, key, data):

        self.cache_data[key] = {

            'data': data,

            'timestamp': time.time()

        }



# Usage example

def get_data_with_ttl(query_key):

    cached_data = ttl_cache.get(query_key)

    if cached_data is not None:

        return cached_data

    # If not in cache or expired, execute query

    result_df = spark.sql(f"SELECT * FROM large_table WHERE key = '{query_key}'")

    ttl_cache.put(query_key, result_df)

    return result_df

# Implementing TTL with DataFrame Caching

# Filter cached data to include only non-expired items based on TTL

def get_valid_cache(cached_df, ttl_ms=3600000):

    if cached_df is None:

        return None

    # Add current timestamp for comparison

    df_with_current = cached_df.withColumn("current_time", current_timestamp().cast("long"))

    # Filter out expired records

    valid_cache = df_with_current.filter(

        col("current_time") - col("cache_time").cast("long") <= ttl_ms

    ).drop("current_time")

    return valid_cache

# Per-Row TTL Values

For more granular control where different data requires different expiration times:

# Create DataFrame with TTL column

from pyspark.sql.functions import when, col



# Add TTL values based on data characteristics

data_with_ttl = df.withColumn(

    "ttl_value",

    when(col("data_type") == "sensitive", 1800)  # 30 minutes for sensitive data

    .when(col("data_type") == "standard", 86400)  # 24 hours for standard data

    .otherwise(604800)  # 1 week for other data

)

Real-World Use Cases:

- Financial Data Pipeline: Caching market quotes with short TTLs based on volatility

- IoT Monitoring: Caching sensor readings with TTLs aligned to data freshness requirements

First In, First Out (FIFO)

Removes items in the order they were added to the cache, without considering usage frequency or recency.

FIFO evicts items in the exact order they were added to the cache, regardless of usage.

Trump addresses trade disputes chronologically. When negotiating bandwidth fills up, he concludes the oldest dispute first (Canadian lumber) before moving to newer ones (Chinese technology), simply because it started first.

When to Use: Suitable for sequential access patterns or when implementation simplicity is prioritized over hit rate optimization, such as streaming data or log processing.

When Not to Use: Avoid when access patterns aren't sequential or when maximizing cache hit rates is critical to application performance.

# Add ingestion timestamp to track order

df_with_timestamp = df.withColumn("ingestion_time", current_timestamp())



# When cache fills, keep only the most recently added items

cached_df = df_with_timestamp.orderBy(col("ingestion_time").desc()).limit(10000)

Real-World Use Cases:

- Streaming Log Analysis: Caching recent log entries while systematically removing oldest entries

- Social Media Pipeline: Caching recent posts while evicting older content as new posts arrive

Random Replacement (RR)

Evicts a randomly chosen item from the cache. This method is simple but does not consider any usage patterns.

RR evicts randomly selected items when the cache needs space, without considering usage patterns.

When Trump needs to free up diplomatic capacity, he randomly abandons certain trade disputes regardless of their importance or recency, perhaps suddenly declaring victory with Malaysia even though nothing changed.

When to Use: Appropriate when computational overhead must be minimized or when access patterns are truly unpredictable, such as in high-throughput, low-latency systems where eviction decision time is critical.

When Not to Use: Avoid when access patterns have clear locality or when maximizing cache hit rates is more important than minimizing eviction overhead.



# Add random values to each record

df_with_random = df.withColumn("random_val", rand())

# Randomly select items to keep in cache

cached_df = df_with_random.orderBy("random_val") .limit(10000).drop("random_val")

Real-World Use Cases:

- Distributed Cache System: Using random eviction when computational overhead of tracking access patterns is too high

- High-Throughput Data Ingestion: Implementing lightweight caching where eviction strategy complexity would create bottlenecks

Two-Tiered Caching

This approach divides the cache into a small "hot" tier for frequently accessed data and a larger "cold" tier for less frequently accessed data.

Trump maintains two trade priority lists: a "critical" list (China, EU, Canada) receiving constant attention and a "secondary" list (Vietnam, Chile) addressed only when capacity allows. When focusing on new issues, he first drops something from the secondary list.

When to Use: Ideal when data access patterns show clear separation between hot and cold data, or when different data types require different retention policies, such as mixed workloads combining transactional and analytical processing.

When Not to Use: Avoid when access patterns are uniform or when the overhead of managing and coordinating multiple cache tiers outweighs potential benefits.

# Divide data into hot and cold tiers based on access frequency

tiered_df = df.withColumn("tier",when(col("access_count") > 100, "hot").otherwise("cold"))



# Apply different retention policies to each tier



hot_cache = tiered_df.filter(col("tier") == "hot").limit(1000)  # Smaller, premium cache


cold_cache = tiered_df.filter(col("tier") == "cold").limit(9000)  # Larger secondary cache

# Union the two tiers for the complete cache

complete_cache = hot_cache.union(cold_cache)

Real-World Use Cases:

- Business Intelligence Platform: Keeping common dashboard data in hot tier while less frequent reports use cold tier

- E-commerce Catalog: Caching bestselling products in hot tier with longer retention while seasonal items use cold tier with aggressive eviction

Cache eviction strategies must be carefully selected based on your data access patterns, system constraints, and performance requirements to optimize resource utilization while maintaining high throughput and low latency.

Points to ponder:

How does cache eviction impact the performance of a data engineering pipeline?

Cache eviction significantly impacts data engineering pipeline performance by determining which data remains readily available for processing and which gets removed when capacity limits are reached. A well-implemented cache eviction strategy can:

Reduce latency: Proper cache eviction ensures that frequently accessed data remains in high-speed memory, minimizing the time required to retrieve that data during pipeline execution. When the right data stays cached, processors can access it quickly without waiting for retrieval from slower storage.

Increase throughput: With faster access to data, the pipeline can process more records in less time. This directly translates to higher overall system throughput as computational resources spend less time waiting for data.

Minimize pipeline stalls: Cache misses (when requested data isn't found in cache) can cause processing stalls while the system fetches data from slower main storage. Effective eviction policies reduce these stalls by keeping the most relevant data cached.

Optimize resource utilization: Cache memory is limited and significantly more expensive than primary storage. Eviction strategies ensure optimal use of this limited resource by balancing between retaining valuable data and making space for new entries.

Prevent bottlenecks: In data pipelines where certain transformations rely on lookup tables or reference data, having this information cached can prevent bottlenecks. The eviction policy determines which reference data remains accessible.

When cache eviction is misconfigured, it can lead to increased latency, decreased throughput, and suboptimal resource usage that degrades overall pipeline performance.

What are the advantages of using LFU over LRU in caching?

LFU (Least Frequently Used) offers several distinct advantages over LRU (Least Recently Used) caching in specific scenarios:

Better handling of frequency-based access patterns: LFU excels when some items are accessed much more frequently than others, irrespective of recency. It prioritizes items with higher access counts, which can lead to up to 30% reduction in cache misses compared to LRU in such scenarios.

Resistance to cache pollution: LFU is less susceptible to "cache pollution" where one-off or infrequent access patterns temporarily displace frequently used items. While LRU would immediately prioritize a newly accessed item, LFU requires consistent access patterns to elevate an item's priority.

Performance optimization for popular content: For applications like recommendation systems, news sites, or analytics platforms where certain data items are consistently popular over time, LFU keeps these high-demand items in cache regardless of recent access patterns.

Balanced long-term performance: Research indicates systems using LFU can achieve up to a 30% improvement in cache hit rates compared to alternative methods, particularly for workloads with stable popularity distributions.

Ideal for specific use cases: LFU is particularly well-suited for:

- Recommendation engines and content delivery networks serving popular items

- Analytics systems where certain dimensions or metrics are queried frequently

- Database query result caching where some queries are consistently popular

Statistical advantages: According to industry statistics, 68% of developers report enhanced overall performance when using LFU for applications with fluctuating access patterns, and 45% of companies observe decreased infrastructure costs due to lower data retrieval demands.

What role does cache eviction play in managing resource constraints

Cache eviction is crucial for managing resource constraints in data systems through several mechanisms:

Memory optimization: By systematically removing less valuable data, eviction policies ensure the limited, expensive cache memory is used for the most performance-critical data. This prevents memory exhaustion while maintaining performance benefits for important workloads.

Resource allocation efficiency: Eviction strategies help prioritize which data deserves to occupy limited resources based on access patterns, importance, or freshness. This intelligent allocation ensures critical data remains cached while less important data is sacrificed when resources are constrained.

Preventing performance degradation: Without proper eviction, caches would fill up completely, leading to thrashing where the system constantly swaps data in and out. Effective eviction policies ensure smooth operation even under memory pressure.

Balancing competing workloads: In multi-tenant systems, eviction policies help ensure fair resource distribution among different applications or users, preventing one workload from monopolizing the cache.

Scaling cost management: Cache eviction extends the usefulness of existing hardware by maximizing hit rates without requiring hardware upgrades. This helps manage infrastructure costs by getting more performance from existing resources.

Reducing backend load: By maintaining an optimally filled cache with the right data, eviction policies reduce the load on backend databases and storage systems, which might have their own resource constraints.

Proper eviction strategies transform cache from a simple speed optimization into a sophisticated resource management tool that helps systems perform well under various constraint scenarios.

How can automatic cache eviction triggers be implemented

Automatic cache eviction triggers can be implemented through several approaches:

Memory-pressure based triggers:

- Monitor memory usage thresholds (e.g., 80% capacity)

- Implement watermark systems with low/high boundaries

- Trigger batch evictions when thresholds are exceeded

- Use memory monitors that integrate with system memory management

Time-based mechanisms:

- Implement background jobs that periodically scan for expired items

- Use scheduled eviction processes running at configured intervals

- Leverage decay functions that gradually decrease item priority over time

- Set TTL (Time To Live) metadata on cached objects at insertion time

Access-pattern analysis:

- Track real-time hit/miss ratios and adjust eviction aggressiveness

- Implement adaptive algorithms that modify policies based on workload changes

- Deploy machine learning models to predict future access patterns

- Use counters and timestamps to implement LRU/LFU logic

Event-driven triggers:

- Trigger evictions on specific application events (e.g., user logout)

- Set cache invalidation hooks on data updates in the source system

- Implement publish/subscribe mechanisms for coordinated eviction

- Use distributed event buses for cluster-wide cache coordination

Implementation strategies:

- Embed triggers in data access layers and cache clients

- Use cache-specific APIs like Redis's MAXMEMORY policy

- Implement custom cache wrappers with eviction logic

- Leverage databases' built-in caching mechanisms with their eviction settings

Effective implementation requires balancing the overhead of the eviction mechanism itself against the performance benefits of intelligent cache management.