Praghadeesh

Understanding parquet a bit more

Praghadeesh T K S — Sat, 02 Aug 2025 09:31:26 GMT

While experimenting with a Spark job to read the same dataset in both CSV and Parquet formats, I observed that queries on Parquet were significantly faster. I was aware that Parquet, being a columnar storage format, is designed to be performant, but I wanted to better understand why it consistently outperforms CSV - especially in OLAP workloads.

Let’s explore the reasons behind Parquet’s performance advantage, and examine the features that make it particularly efficient for analytical queries.

The following PySpark code snippet demonstrates generating a synthetic dataset of 500,000 records using the Faker library. The dataset is then persisted in both CSV and Parquet formats for comparison.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
from faker import Faker
import time

spark = SparkSession.builder \
    .appName("CSV vs Parquet Performance") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .getOrCreate()

fake = Faker()

# ---------------------------
# Generate synthetic dataset
# ---------------------------
num_records = 500_000  

data = []
for i in range(num_records):
    data.append((
        i,                                      # id
        fake.name(),                            # name
        fake.email(),                           # email
        fake.city(),                            # city
        fake.random_int(min=18, max=80),        # age
        fake.boolean()                          # is_active
    ))

columns = ["id", "name", "email", "city", "age", "is_active"]

df = spark.createDataFrame(data, columns)

csv_path = "faker_csv_dataset"
parquet_path = "faker_parquet_dataset"

df.write.mode("overwrite").csv(csv_path, header=True)
df.write.mode("overwrite").parquet(parquet_path)

# Define schema explicitly
custom_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("city", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("is_active", BooleanType(), True)
])

def measure_time(description, file_format, path, read_func):
    start = time.time()
    if file_format == "csv":
        df = spark.read.format(file_format).schema(custom_schema).option("header", True).load(path)
    else:
        df = spark.read.format(file_format).load(path)
    result = read_func(df)
    elapsed = time.time() - start
    print(f"{description} | {file_format.upper()} Time: {elapsed:.2f} sec")
    return elapsed

def read_all_columns(df): return df.count()
def read_selected_columns(df): return df.select("id", "name").count()
def read_with_filter(df): return df.filter("age > 40 AND is_active = true").count()
def read_count_only(df): return df.count()

scenarios = [
    ("Full Scan", read_all_columns),
    ("Column Projection", read_selected_columns),
    ("Filtered Read", read_with_filter),
    ("Count Only", read_count_only)
]

results = []
for desc, func in scenarios:
    csv_time = measure_time(desc, "csv", csv_path, func)
    parquet_time = measure_time(desc, "parquet", parquet_path, func)
    results.append((desc, csv_time, parquet_time, csv_time/parquet_time))

print("\n=== Performance Summary ===")
print(f"{'Scenario':<20}{'CSV (sec)':<15}{'Parquet (sec)':<15}{'Speedup'}")
for desc, csv_t, parq_t, speedup in results:
    print(f"{desc:<20}{csv_t:<15.2f}{parq_t:<15.2f}{speedup:.2f}x")

The time taken to read both the CSV and Parquet files under different scenarios is summarized below.

=== Performance Summary ===
Scenario            CSV (sec)      Parquet (sec)  Speedup
Full Scan           1.69           0.35           4.90x
Column Projection   0.30           0.11           2.58x
Filtered Read       0.38           0.19           2.02x
Count Only          0.10           0.09           1.07x

Although both files contain the same dataset, the CSV file size is approximately 31 MB, whereas the Parquet file size is only 15 MB - roughly a 50% reduction. This size difference stems from Parquet’s columnar storage layout, dictionary encoding, and compression. As dataset volume scales, this difference becomes more pronounced, with Parquet offering significant storage efficiency and I/O performance improvements compared to CSV.

What is Parquet?

💡

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

Storage layout models

As illustrated in the above diagram, row-oriented storage organizes data using horizontal partitioning, where all values of a row are stored contiguously. In contrast, column-oriented storage applies vertical partitioning, storing all values of a column contiguously. Additionally, modern formats such as Parquet adopt a hybrid approach, combining the advantages of both - organizing data in row groups (horizontal segmentation) while storing each column within those groups in a columnar layout for optimized compression and query performance.

Row-oriented storage is typically better suited for OLTP (Online Transaction Processing) workloads, where the system handles numerous small, transactional operations across different rows. In such systems, insert operations can simply append new rows to the end of the dataset, while update operations can directly locate the target row and modify the corresponding column values in place. This design optimizes for fast row-level writes and updates common in transactional systems.

Column-oriented storage is generally better suited for OLAP (Online Analytical Processing) workloads, which involve large-scale operations on a subset of columns. Unlike row stores, columnar formats are not ideal for OLTP scenarios, because inserting a new row requires updating multiple column segments stored in different locations - resulting in a fragmented memory access pattern and higher write overhead.

In OLAP workloads, however, this columnar design provides significant advantages:

Projection pushdown allows the query engine to read only the relevant columns instead of scanning the entire dataset.
Compression efficiency is improved because similar data values are stored adjacently within each column.

Fragmented Memory access

As illustrated in the diagram above, when a query requires only columns A and C, a row-oriented format still necessitates reading entire rows, leading to a fragmented memory access pattern and introducing potential overhead. In contrast, a columnar format stores each column’s values contiguously, enabling the query engine to retrieve the required columns in a linear, sequential access pattern. This results in more efficient I/O utilization and faster query performance for column-specific analytical workloads.

That said, a pure columnar model is not always optimal for row reconstruction. Since values for each column are stored separately, reconstructing full rows often requires scanning multiple column segments and merging them during query execution. This can introduce additional CPU and memory overhead, particularly in workloads that frequently require complete row materialization.

This is where the hybrid storage model comes into play, combining the advantages of both row-oriented and column-oriented approaches.

Parquet dataset on disk is not represented as a single physical file. Instead, it is typically organized as a directory structure, where the logical dataset is defined by the root directory. This root directory contains multiple Parquet part files.

dataset_root/
  ├── part-00000-uuid.parquet
  ├── part-00001-uuid.parquet
  └── part-00002-uuid.parquet

Parquet data organization

Directory → Part Files → Row Groups → Column Chunks → Pages

A single Parquet file can contain multiple Row Groups, each holding a horizontal slice of the dataset. The number and size of Row Groups within a file are determined by the configured Row Group size and the amount of data written.

Row Groups

Horizontal partition of data inside a file.
Each Row Group contains all columns for a subset of rows.
Default size ≈ 128 MB (configurable).
Optimized for parallel reads: Each Row Group can be read independently.

Column Chunks

Vertical partition of data inside a Row Group.
Each Column Chunk contains the values of a single column for all rows in that Row Group.
Statistics for each column chunk:
- Minimum value
- Maximum value
- Null count
Enables:
- Projection pushdown (read only required columns and with the above statistics we can skip the row groups that doesn’t match the filter conditions)
- Better compression (similar values are adjacent)

Pages

The output above demonstrates that the inspected column chunk contains three pages, utilizing dictionary encoding wherever applicable.

Smallest unit of storage in a Column Chunk.
Types of pages:
- Data Pages: Contain actual column values (can use encodings like Dictionary, RLE, Delta).
- Dictionary Pages: Contain unique values for dictionary encoding.
- Index Pages: Optional, for faster lookups.
Default page size ≈ 1 MB.

Upon inspecting one of the Parquet part files using the parquet-cli tool, we observe that the file contains a single Row Group with approximately 49,000 records. The Row Group metadata also includes column-level statistics, such as the minimum and maximum values for each column.

For example, if the Age column in this file has a recorded maximum value of 70, and a query requests only records where Age > 70, the query engine can determine—based on these statistics—that no matching rows exist in this file. As a result, the engine can skip reading this file entirely, avoiding unnecessary I/O. This optimization, known as predicate pushdown, is one of the key reasons why Parquet performs so well for analytical workloads.

Encoding Schemes

Dictionary Encoding

In the example above, the Column Chunk for city includes a Dictionary Page that stores the unique city names present in the Row Group. All subsequent Data Pages within this column reference these values through dictionary indexes, rather than storing the full strings repeatedly. This approach significantly reduces storage size and improves read performance.
How it works:
- Stores a dictionary of unique values once
- Replaces actual values with integer indexes into the dictionary
When used: Columns with low cardinality (few distinct values)
Pros: Significant space saving

Example:

  Original Values: [London, Paris, London, New York, Paris]
  Dictionary: {0: London, 1: Paris, 2: New York}
  Encoded Indexes: [0, 1, 0, 2, 1]

Run-Length Encoding (RLE)

How it works: Compresses consecutive repeated values as a single value + count
When used: Columns with runs of repeated values
Pros: Excellent for sorted / repeated data

Example:

  Original Values: [A, A, A, A, B, B]
  Encoded: (A,4), (B,2)

Delta Encoding (Delta Binary Packed)

How it works:
- Stores the difference (delta) between consecutive values
- Often combined with bit packing for integers
When used: Numeric columns where values are sorted or increment gradually

Bit Packing / Boolean Packing

How it works: Stores multiple boolean values or small integers in a single byte or word by packing bits tightly.
When used: Boolean or low-cardinality integer fields

Example:

  Boolean Values: [T, F, T, T, F, F, T, F]
  Encoded as: 10110010 (binary)

Plain Encoding

How it works: Stores values in their raw form (uncompressed binary representation).
When used:
- As a fallback when no encoding benefits are possible
Pros: Fast decode speed
Cons: Larger size if values are repetitive
Example:
```
  Values: [25, 30, 35, 40]
  Encoded: 25 30 35 40 (no transformation)
```
Parquet’s hybrid columnar format makes it compact, efficient, and fast for analytical workloads. With features like projection pushdown, predicate pushdown, and advanced encodings, it minimizes storage and speeds up queries.

Thus, Parquet has become the default storage format for modern open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi, thanks to its efficiency, scalability, and compatibility with analytical workloads.

I hope this article added value to your understanding of Parquet internals. Thanks for reading! 🚀

References

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)
Encoding in Parquet | Optimizing for Storage
LLM Models - GPT / Claude

Bored of Prompt Engineering? Try DSPy

Praghadeesh T K S — Sun, 13 Jul 2025 08:37:47 GMT

With the advent of LLMs in the recent few years, we stumbled upon something called Prompt Engineering - let’s speak a few words about it to set the context right and then let’s venture into DSPy

What is Prompt Engineering?

Prompt Engineering is the art of designing the instructions that we give to a language model, to get the desired result. Language models don’t really understand the intentions of the user, they just predict the next word based on the provided prompt, thus prompting (right) becomes very crucial.

Though there are different types of prompts and engineering associated with it, It usually comprises of the following steps

Choosing the right instruction
Providing examples
Specifying the format
Testing and refining

Well, If you’ve worked for some time with the prompts, you would have been quickly acquainted with its limitations and tedious process of trial and error associated with crafting the right prompt

Challenges with Prompt Engineering

Writing the prompts, tweaking them with trial and error approaches
Prompt can grow bigger and heavy, spanning to thousands of words (expensive)
Heavily biased with Language models, often not LM agnostic (may work with GPT, but with LLaMa?)

What if a tool or framework can solve some or all of the above problems, that is DSPy!

Programming not Prompting, What’s DSPy?

DSPy is an open source declarative framework in Python that let’s you to define the steps of LLM workflow as modular components, compose them into pipelines, optimize prompts with feedback loops. This shifts the narrative that working with LLMs is more about prompting to think of it more like a program.

Just define the input and outputs, and the rest will be take care of by DSPy

Declarative approach
Optimizers
Composable chains
Open source and Flexible

Core components of DSPy

The core components of DSPy are

Signature
Modules
Adapters
Optimizer

We can look at these more in details, as we walk through the example

Configuring the LLM

import dspy
dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))

In the above piece of code, we are just refering the language model we’ll want to use. This assumes that you have the dspy package installed on your machine and OpenAI API key is already set as an environment variable for the code to pick it up

Let’s define signature

A Signature in DSPy is like a contract that clearly describes:

✅ What inputs your module expects
✅ What outputs it should produce

Think of it as a blueprint that tells DSPy and the language model:

Here are the fields you need to fill in.
Here’s what the output should look like.

Signature can be class based or can be Inline signatures

# Class based Signature
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a text."""

    text: str = dspy.InputField(desc="input text to classify sentiment")
    sentiment: int = dspy.OutputField(
        desc="sentiment, the higher the more positive", ge=0, le=10
    )

# Inline Signature
str_signature = dspy.make_signature("text -> sentiment")

As we can see in the above code snippet, we have created a Signature with Input and Output fields and what they can be, this enforces the behaviour of the LLM.

Modules

Encapsulates the logic for interacting with LMs, the simplest module that DSPy provides is dspy.Predict. Multiple modules can be chained together and can be used as a bigger module.

There are different types of modules that DSPy provides us out of the box

dspy.Predict - Simple LLM interaction, often the building block for other complex modules
dspy.ChainOfThought - Asks LM to add reasoning as part of the final answer, this modifies the signature
dspy.ProgramOfThought - Asks LM to output code, whose execution results will dictate the response.
dspy.ReAct - An agent that can use tools to implement the given signature.

predict = dspy.Predict(SentimentClassifier) 

output = predict(text="I am feeling pretty happy!")
print(f"The sentiment is: {output.sentiment}")

In the above code snippet, we have used the most fundamental module dspy.Predict. Let’s look at the prompt it has generated

dspy.inspect_history(n=1)

The above piece of snippet gives the prompt that was generated by dspy.

Let’s try the same with dspy.ChainOfThought Module and see what it does to the signature

cot = dspy.ChainOfThought(SentimentClassifier)

output = cot(text="I am feeling pretty happy!")
print(output)

Let’s look at the prompt generated

As we can see in the above prompt, the signature has been added with another output field called reasoning which was not user defined when used with dspy.ChainOfThought module. The reasoning field provides a justification on how it arrived to a specific output based on the user query.

I hope the above gives a primer to the DSPy framework, let’s discuss more about Adapters and Optimizers with a full fledged working example in the upcoming blog.

References:

DSPy: Build and Optimize Agentic Apps

DSPy: Learn

Understanding Single Tenant and Multi Tenant architecture ✨

Praghadeesh T K S — Tue, 24 Sep 2024 17:50:09 GMT

Have you ever wondered about the difference between having a blog hosted in Ghost Instance and a blogging platform like Hashnode? 🤔

Have you ever wondered about the difference in living in a villa and an apartment? 🤔

The difference between the above, pretty much sums up about Single Tenant and Multi Tenant architecture that might be commonly stumbled upon in the Software ecosystem

When comparing the architecture of a villa and an apartment, think of it in terms of space, independence, and shared resources.

🏘️ Villa Architecture: A villa is like a single-tenant setup in software architecture. It offers complete independence in terms of space, design, and customization. You have your own private land, no shared walls, and complete control over modifications, similar to how a single-tenant system allows a user full control over their environment. Villas also provide more space, privacy, and autonomy but come with higher maintenance responsibilities.
🏢 Apartment Architecture: An apartment, on the other hand, is akin to multi-tenant architecture. While you have your own private living space (your unit), you share walls, common areas, infrastructure, and amenities with other tenants. It’s more cost-effective, easier to maintain, and designed to accommodate multiple residents in a shared environment, just like how multi-tenant systems are designed to serve multiple users on shared infrastructure efficiently.

Single-Tenant Architecture

In a single-tenant architecture, each customer or tenant has their own dedicated software instance, which includes separate application instance and database environments. Technically, the software is replicated for each customer, ensuring full data isolation and customization at the cost of increased resource consumption.

For instance, a media company running a popular blog might self-host Ghost to ensure complete ownership of their content, customize the blog’s performance, and manage sensitive data directly. This setup is preferred if they have specific privacy requirements or need to make extensive customizations that wouldn’t be possible in a shared multi-tenant environment.

Key Technical Characteristics:

Dedicated Infrastructure: Each tenant has an isolated application and database stack, typically running on separate virtual machines or containers.
Data Isolation: Complete separation of tenant data with no shared storage or compute resources.
Customizability: High degree of flexibility in terms of software configurations, security rules, and application logic.
Security: Since each tenant operates in an isolated environment, security risks are minimized due to lack of shared resources.

Pros of Single-Tenant Architecture

Enhanced Security:
- Full isolation ensures that each tenant's data is siloed from others. This prevents cross-tenant breaches that can occur in shared environments.
- Attack vectors like shared database vulnerabilities are eliminated. Customers can configure their own security policies, firewalls, and encryption standards.
Full Customization:
- Each instance can be highly tailored to the needs of the tenant, allowing custom features, unique workflows, and specific data structures.
- Flexibility to modify backend configurations, database schemas, and even run custom plugins without affecting other customers.
Performance Stability:
- No sharing of computational resources means consistent performance. High availability can be assured through isolated scaling and resource allocation.
- Performance tuning for each tenant becomes easier. Tenants can deploy performance-enhancing services like caching, autoscaling, or load balancing without worrying about other tenants overloading the system.

Cons of Single-Tenant Architecture

Higher Costs:
- The dedicated resources for each tenant make this architecture more expensive, as providers need to maintain separate instances for each customer.
- Infrastructure costs can rise exponentially as each instance requires its own VM, storage, and network configurations. The provider also incurs higher operational overhead for managing, monitoring, and scaling individual instances.
Complex Maintenance and Upgrades:
- Every tenant’s instance must be updated separately, which increases operational complexity, especially for bug fixes, security patches, and version upgrades.
- Providers have to carefully manage version control and deployment pipelines for each instance. Automation tools like Ansible or Terraform become critical for managing infrastructure at scale.
Inefficient Resource Utilization:
- Resources are allocated on a per-tenant basis, which often leads to underutilization.
- CPU, memory, and storage resources might be wasted if a tenant doesn’t use their full capacity. This becomes inefficient compared to pooling resources across multiple tenants.

Multi-Tenant Architecture

In a multi-tenant architecture, multiple tenants share a single instance of the software and database. Although the resources are shared, each tenant’s data is segregated logically, usually at the database or application layer. I would consider SaaS applications are an ideal candidate for multi-tenant architecture.

A typical example is Shopify, where thousands of e-commerce stores share the same application instance, but each store has its own segregated data and custom configurations. Shopify scales efficiently by pooling resources across all tenants.

Key Technical Characteristics:

Shared Infrastructure: All tenants share a single instance of the software, often including the application server, database, and compute resources.
Data Segregation: Data is logically separated either at the database level (e.g., separate tables per tenant) or using a single database with tenant-specific tags (e.g., a tenant_id field in each record) with other techniques such as RLS (Row Level Security) is leveraged.
Elastic Resource Utilization: Resources (CPU, memory, storage) are pooled and dynamically allocated based on tenant needs.
Scalability: It’s easier for the provider to scale the system because they only need to scale one instance rather than multiple isolated instances.

Pros of Multi-Tenant Architecture

Cost Efficiency:
- By sharing infrastructure, the cost per tenant is significantly reduced. Providers can optimize hardware and software resources across a larger number of customers.
- Tenants share common resources like load balancers, databases, and application servers, leading to reduced operational costs and more efficient scaling.
Simplified Maintenance:
- Updates, bug fixes, and patches can be rolled out to all tenants at once, reducing the complexity of version management.
- CI/CD pipelines are streamlined as the provider needs to manage only one instance. DevOps tools like Kubernetes can be used to automate rolling updates and deployments.
Scalability:
- Multi-tenant architectures scale horizontally by adding more tenants to the same instance. Resources can be dynamically allocated based on load, allowing better handling of peak usage.
- With autoscaling features in cloud platforms (e.g., AWS, Google Cloud), the provider can elastically scale the system to handle increased traffic without requiring separate infrastructure for each tenant.

Cons of Multi-Tenant Architecture

Limited Customization:
- Since multiple tenants share the same application and database, there is less flexibility in customization. Changes made to the software affect all tenants.
- Tenant-specific configurations are typically limited to front-end settings or user-level preferences, with minimal ability to alter core application logic or database schemas.
Security Risks:
- Although tenants’ data is logically separated, a security vulnerability in the shared infrastructure could expose data across tenants.
- A poorly configured shared database or a misconfigured security policy can lead to data leakage across tenants.
Performance Fluctuations:
- Resource usage spikes from one tenant can affect the performance of other tenants sharing the same infrastructure.
- Even with resource limits, a heavy-load tenant can degrade performance for others. Providers might need to consider performance tuning and load balancing to handle such cases.

Key differences

The decision between single-tenant and multi-tenant architecture depends on your business and technical requirements, particularly in the areas of cost, security, and scalability.

Feature	Single-Tenant	Multi-Tenant
Customization	Customization completely in control	Limited Customization
Cost	Dedicated and Isolated infrastructure	Comman and shared resources
Security	Complete data isolation	Shared infrastructure risk
Scalability	Low - Vertical scaling per instance	High - Horizontal scaling across tenants
Resource Utilization	Isolated instances	Pooled resources, efficient use
Maintenance	Pretty complex as the number of instances increases	Simple - Centralized updates and management

☺️ Summary

Single-tenant architecture provides maximum control, customization, and security, making it ideal for industries with strict compliance requirements (e.g., healthcare, banking). However, it comes at a higher cost and greater complexity in terms of maintenance and scaling.

Multi-tenant architecture, on the other hand, is highly scalable, cost-effective, and easier to maintain, making it the go-to choice for most SaaS providers serving a broad range of customers with standard needs.

Advancing RAG with unstructured.io

Praghadeesh T K S — Sat, 06 Jul 2024 16:56:37 GMT

Hello All, This is Praghadeesh back to writing blogs after a while (I lost my previously hosted Ghost Instance with no backups and had to start from scratch 😕). In this blog, let's explore a bit more on RAG by trying to work on some complex PDFs leveraging the capabilities of unstructured.io and Langchain's MultiVectorRetriever this time.

What is unstructured.io?

Unstructured.io is an open source project that provides tools with capabilities to work on diverse source of documents such as PDF, HTML and so on and helps us to streamline the data processing workflow for LLMs. It's more of an ETL tool for Gen AI use cases. It comes in three different offerings

Serverless API
Azure/AWS Marketplace offering
Self hostable solution

What is RAG and why to use unstructured.io with RAG?

If the title of the blog interested you and you are already here reading the blog, you might probably know what RAG is all about. In oversimplified terms, it's just the art of injecting context to the LLMs where the goal is to help them answer questions that is beyond the training data of LLM. I believe this might be a perfect analogy, It's like an Open Book Exam, where you try to find the relavant content from the book and make sense out of it.

Cool, but how unstructured helps here? The process of RAG becomes complex when we try to deal with diverse contents such as Tables, Images, Vector Diagrams, Formulae and so on. Unstructured.io will help us to work with some of these data and make our job a bit easier, the scope of the blog is limited to handing the data in tabular format in complex PDFs.

Working with Complex PDFs

Complex PDFs may involve Financial Reports, Scientific Research papers, Technical Reference Document, Engineering Datasheets and so on. In this blog, let's try dealing with a datasheet for an Electrical component called LM-317 a linear voltage regulator.

The above is an example of how the content of the datasheet looks like, it has multiple pages with such tables and vector diagrams where extracting data without loosing quality might not be possible with traditional RAG.

Semi Structured RAG with Multi Vector Retreiver

The idea here is to extract text and table chunks separately as shown above using unstructured
Create a summarization chain and generate summary for texts and tables
Ingest the text and table summary with corresponding embeddings into the vector store
Ingest the Raw chunks into the docstore or memorystore
Query against the summary embeddings, retrieve the corresponding chunks from docstore associated with summary in vectorstore and pass the chunk to the LLM to make sense out of it

💡

Note: This blog covers a high level implementation of the code to get the context right. The details of unstructured capabilities and code breakdown will be coverted in the upcoming blogs.

Partitioning the PDF document using unstructured

unstruct_client = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_AUTH_KEY")
)

filename = "lm317.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
    skip_infer_table_types=[],
    pdf_infer_table_structure=True
)

try:
    resp = unstruct_client.general.partition(req)
    pdf_elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

In the above part of code, the partitioning of PDF is executed. Unstructured simplifies the preprocessing of structured and unstructured documents for downstream tasks, irrespective of what type of file content is provided as source. When partioned the result is a list of Element objects.

The below is an example of how the Partition output looks like, the Elements can be of type Title, NarrativeText, Image, Table, ListItem, Header, Footer and so on.

{
       "type":"Title",
       "element_id":"d8ecdee23702fdb35f98390141100d13",
       "text":"from 1.25 V to 37 V",
       "metadata":{
          "filetype":"application/pdf",
          "languages":[
             "eng"
          ],
          "page_number":1,
          "filename":"lm317.pdf"
       }
    },
    {
       "type":"ListItem",
       "element_id":"eb105b9f3e577473acac7ba394cea3c7",
       "text":"Output current greater than 1.5 A • • Thermal overload protection • Output safe-area compensation",
       "metadata":{
          "filetype":"application/pdf",
          "languages":[
             "eng"
          ],
          "page_number":1,
          "parent_id":"d8ecdee23702fdb35f98390141100d13",
          "filename":"lm317.pdf"
       }
    },

Chunking the elements obtained after partitioning

The partitions created are then chunked using the chunking strategy - chunk_by_title.

💡

The by_title chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.

The chunks are categrozied as table chunks and text chunks respectively and a summary chain is created using the Google Gemini Pro model, which will help us creating a list of table summaries and text summaries.

chunks = chunk_by_title(pdf_elements,max_characters=4000,new_after_n_chars=3800, combine_text_under_n_chars=2000)

class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in chunks:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatGoogleGenerativeAI(model="gemini-pro")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})

Adding the Summaries and Documents to Vector Store and Doc Store

The summaries are added to the vector store (ChromaDB in this case) and the raw chunks are added to the docstore both mapped with a uid.

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=FastEmbedEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

Creating the answer chain

As a final process, the RAG chain is created and the query is passed as an input to the RAG chain.

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatGoogleGenerativeAI(model="gemini-pro")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

As we can see above, the LLM Chain is able to provide us with accurate results from the tables present in the datasheet of LM317 Linear Voltage Regulator.

References
1. Semi Structured RAG Cookbook
2. Unstructured IO Documentation