<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Praghadeesh]]></title><description><![CDATA[Praghadeesh]]></description><link>https://blog.praghadeesh.com</link><generator>RSS for Node</generator><lastBuildDate>Tue, 28 Apr 2026 17:09:47 GMT</lastBuildDate><atom:link href="https://blog.praghadeesh.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Understanding parquet a bit more]]></title><description><![CDATA[While experimenting with a Spark job to read the same dataset in both CSV and Parquet formats, I observed that queries on Parquet were significantly faster. I was aware that Parquet, being a columnar storage format, is designed to be performant, but ...]]></description><link>https://blog.praghadeesh.com/understanding-parquet-a-bit-more</link><guid isPermaLink="true">https://blog.praghadeesh.com/understanding-parquet-a-bit-more</guid><dc:creator><![CDATA[Praghadeesh T K S]]></dc:creator><pubDate>Sat, 02 Aug 2025 09:31:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/hpjSkU2UYSU/upload/bd48efbe6bd2990b5ff1f5c9c1046df4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While experimenting with a Spark job to read the same dataset in both <strong>CSV</strong> and <strong>Parquet</strong> formats, I observed that queries on Parquet were significantly faster. I was aware that Parquet, being a <strong>columnar storage format</strong>, is designed to be performant, but I wanted to better understand <strong>why</strong> it consistently outperforms CSV - especially in <strong>OLAP workloads</strong>.</p>
<p>Let’s explore the reasons behind Parquet’s performance advantage, and examine the features that make it particularly efficient for analytical queries.</p>
<p>The following PySpark code snippet demonstrates generating a synthetic dataset of <strong>500,000 records</strong> using the <strong>Faker</strong> library. The dataset is then persisted in both <strong>CSV</strong> and <strong>Parquet</strong> formats for comparison.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> StructType, StructField, IntegerType, StringType, BooleanType
<span class="hljs-keyword">from</span> faker <span class="hljs-keyword">import</span> Faker
<span class="hljs-keyword">import</span> time

spark = SparkSession.builder \
    .appName(<span class="hljs-string">"CSV vs Parquet Performance"</span>) \
    .config(<span class="hljs-string">"spark.driver.bindAddress"</span>, <span class="hljs-string">"127.0.0.1"</span>) \
    .getOrCreate()

fake = Faker()

<span class="hljs-comment"># ---------------------------</span>
<span class="hljs-comment"># Generate synthetic dataset</span>
<span class="hljs-comment"># ---------------------------</span>
num_records = <span class="hljs-number">500</span>_000  

data = []
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(num_records):
    data.append((
        i,                                      <span class="hljs-comment"># id</span>
        fake.name(),                            <span class="hljs-comment"># name</span>
        fake.email(),                           <span class="hljs-comment"># email</span>
        fake.city(),                            <span class="hljs-comment"># city</span>
        fake.random_int(min=<span class="hljs-number">18</span>, max=<span class="hljs-number">80</span>),        <span class="hljs-comment"># age</span>
        fake.boolean()                          <span class="hljs-comment"># is_active</span>
    ))

columns = [<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"email"</span>, <span class="hljs-string">"city"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"is_active"</span>]

df = spark.createDataFrame(data, columns)

csv_path = <span class="hljs-string">"faker_csv_dataset"</span>
parquet_path = <span class="hljs-string">"faker_parquet_dataset"</span>

df.write.mode(<span class="hljs-string">"overwrite"</span>).csv(csv_path, header=<span class="hljs-literal">True</span>)
df.write.mode(<span class="hljs-string">"overwrite"</span>).parquet(parquet_path)

<span class="hljs-comment"># Define schema explicitly</span>
custom_schema = StructType([
    StructField(<span class="hljs-string">"id"</span>, IntegerType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"name"</span>, StringType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"email"</span>, StringType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"city"</span>, StringType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"age"</span>, IntegerType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"is_active"</span>, BooleanType(), <span class="hljs-literal">True</span>)
])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">measure_time</span>(<span class="hljs-params">description, file_format, path, read_func</span>):</span>
    start = time.time()
    <span class="hljs-keyword">if</span> file_format == <span class="hljs-string">"csv"</span>:
        df = spark.read.format(file_format).schema(custom_schema).option(<span class="hljs-string">"header"</span>, <span class="hljs-literal">True</span>).load(path)
    <span class="hljs-keyword">else</span>:
        df = spark.read.format(file_format).load(path)
    result = read_func(df)
    elapsed = time.time() - start
    print(<span class="hljs-string">f"<span class="hljs-subst">{description}</span> | <span class="hljs-subst">{file_format.upper()}</span> Time: <span class="hljs-subst">{elapsed:<span class="hljs-number">.2</span>f}</span> sec"</span>)
    <span class="hljs-keyword">return</span> elapsed

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_all_columns</span>(<span class="hljs-params">df</span>):</span> <span class="hljs-keyword">return</span> df.count()
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_selected_columns</span>(<span class="hljs-params">df</span>):</span> <span class="hljs-keyword">return</span> df.select(<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>).count()
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_with_filter</span>(<span class="hljs-params">df</span>):</span> <span class="hljs-keyword">return</span> df.filter(<span class="hljs-string">"age &gt; 40 AND is_active = true"</span>).count()
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_count_only</span>(<span class="hljs-params">df</span>):</span> <span class="hljs-keyword">return</span> df.count()

scenarios = [
    (<span class="hljs-string">"Full Scan"</span>, read_all_columns),
    (<span class="hljs-string">"Column Projection"</span>, read_selected_columns),
    (<span class="hljs-string">"Filtered Read"</span>, read_with_filter),
    (<span class="hljs-string">"Count Only"</span>, read_count_only)
]

results = []
<span class="hljs-keyword">for</span> desc, func <span class="hljs-keyword">in</span> scenarios:
    csv_time = measure_time(desc, <span class="hljs-string">"csv"</span>, csv_path, func)
    parquet_time = measure_time(desc, <span class="hljs-string">"parquet"</span>, parquet_path, func)
    results.append((desc, csv_time, parquet_time, csv_time/parquet_time))

print(<span class="hljs-string">"\n=== Performance Summary ==="</span>)
print(<span class="hljs-string">f"<span class="hljs-subst">{<span class="hljs-string">'Scenario'</span>:&lt;<span class="hljs-number">20</span>}</span><span class="hljs-subst">{<span class="hljs-string">'CSV (sec)'</span>:&lt;<span class="hljs-number">15</span>}</span><span class="hljs-subst">{<span class="hljs-string">'Parquet (sec)'</span>:&lt;<span class="hljs-number">15</span>}</span><span class="hljs-subst">{<span class="hljs-string">'Speedup'</span>}</span>"</span>)
<span class="hljs-keyword">for</span> desc, csv_t, parq_t, speedup <span class="hljs-keyword">in</span> results:
    print(<span class="hljs-string">f"<span class="hljs-subst">{desc:&lt;<span class="hljs-number">20</span>}</span><span class="hljs-subst">{csv_t:&lt;<span class="hljs-number">15.2</span>f}</span><span class="hljs-subst">{parq_t:&lt;<span class="hljs-number">15.2</span>f}</span><span class="hljs-subst">{speedup:<span class="hljs-number">.2</span>f}</span>x"</span>)
</code></pre>
<p>The time taken to read both the CSV and Parquet files under different scenarios is summarized below.</p>
<pre><code class="lang-python">=== Performance Summary ===
Scenario            CSV (sec)      Parquet (sec)  Speedup
Full Scan           <span class="hljs-number">1.69</span>           <span class="hljs-number">0.35</span>           <span class="hljs-number">4.90</span>x
Column Projection   <span class="hljs-number">0.30</span>           <span class="hljs-number">0.11</span>           <span class="hljs-number">2.58</span>x
Filtered Read       <span class="hljs-number">0.38</span>           <span class="hljs-number">0.19</span>           <span class="hljs-number">2.02</span>x
Count Only          <span class="hljs-number">0.10</span>           <span class="hljs-number">0.09</span>           <span class="hljs-number">1.07</span>x
</code></pre>
<p>Although both files contain the same dataset, the CSV file size is approximately <strong>31 MB</strong>, whereas the Parquet file size is only <strong>15 MB</strong> - roughly a <strong>50% reduction</strong>. This size difference stems from Parquet’s <strong>columnar storage layout, dictionary encoding, and compression</strong>. As dataset volume scales, this difference becomes more pronounced, with Parquet offering significant <strong>storage efficiency</strong> and <strong>I/O performance improvements</strong> compared to CSV.</p>
<h3 id="heading-what-is-parquet">What is Parquet?</h3>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.</div>
</div>

<h3 id="heading-storage-layout-models">Storage layout models</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754120234076/d9afbf5b-670d-4f6a-ba1d-c7c7a206a9be.png" alt="Storage Layout Models" class="image--center mx-auto" /></p>
<p>As illustrated in the above diagram, <strong>row-oriented storage</strong> organizes data using <em>horizontal partitioning</em>, where all values of a row are stored contiguously. In contrast, <strong>column-oriented storage</strong> applies <em>vertical partitioning</em>, storing all values of a column contiguously. Additionally, modern formats such as Parquet adopt a <strong>hybrid approach</strong>, combining the advantages of both - organizing data in <strong>row groups</strong> (horizontal segmentation) while storing each column within those groups in a <strong>columnar layout</strong> for optimized compression and query performance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754120890221/eb16f747-43da-4e0b-aaf3-60f6eda0356d.png" alt class="image--center mx-auto" /></p>
<p><strong>Row-oriented storage</strong> is typically better suited for <strong>OLTP (Online Transaction Processing) workloads</strong>, where the system handles numerous small, transactional operations across different rows. In such systems, <strong>insert operations</strong> can simply append new rows to the end of the dataset, while <strong>update operations</strong> can directly locate the target row and modify the corresponding column values in place. This design optimizes for <strong>fast row-level writes and updates</strong> common in transactional systems.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754121002668/d7ee189e-988c-445c-91a9-81d56324ac6a.png" alt class="image--center mx-auto" /></p>
<p><strong>Column-oriented storage</strong> is generally better suited for <strong>OLAP (Online Analytical Processing) workloads</strong>, which involve large-scale operations on a subset of columns. Unlike row stores, columnar formats are not ideal for OLTP scenarios, because inserting a new row requires updating multiple column segments stored in different locations - resulting in a <strong>fragmented memory access pattern</strong> and higher write overhead.</p>
<p>In OLAP workloads, however, this columnar design provides significant advantages:</p>
<ul>
<li><p><strong>Projection pushdown</strong> allows the query engine to read only the relevant columns instead of scanning the entire dataset.</p>
</li>
<li><p><strong>Compression efficiency</strong> is improved because similar data values are stored adjacently within each column.</p>
</li>
</ul>
<h3 id="heading-fragmented-memory-access">Fragmented Memory access</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754121350411/2fcd1908-74b7-40b7-81f9-3d0497696f12.png" alt class="image--center mx-auto" /></p>
<p>As illustrated in the diagram above, when a query requires only <strong>columns A and C</strong>, a <strong>row-oriented format</strong> still necessitates reading entire rows, leading to a <strong>fragmented memory access pattern</strong> and introducing potential overhead. In contrast, a <strong>columnar format</strong> stores each column’s values contiguously, enabling the query engine to retrieve the required columns in a <strong>linear, sequential access pattern</strong>. This results in more efficient <strong>I/O utilization</strong> and faster query performance for column-specific analytical workloads.</p>
<p>That said, a <strong>pure columnar model</strong> is not always optimal for <strong>row reconstruction</strong>. Since values for each column are stored separately, reconstructing full rows often requires <strong>scanning multiple column segments</strong> and merging them during query execution. This can introduce additional <strong>CPU and memory overhead</strong>, particularly in workloads that frequently require complete row materialization.</p>
<p>This is where the <strong>hybrid storage model</strong> comes into play, combining the advantages of both <strong>row-oriented</strong> and <strong>column-oriented</strong> approaches.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754121949407/17b4aea2-d9d2-4201-87e0-f7fef3dc46f9.png" alt class="image--center mx-auto" /></p>
<p>Parquet dataset on disk is not represented as a single physical file. Instead, it is typically organized as a <strong>directory structure</strong>, where the <strong>logical dataset</strong> is defined by the root directory. This root directory contains multiple <strong>Parquet part files</strong>.</p>
<pre><code class="lang-markdown">dataset<span class="hljs-emphasis">_root/
  ├── part-00000-uuid.parquet
  ├── part-00001-uuid.parquet
  └── part-00002-uuid.parquet</span>
</code></pre>
<h3 id="heading-parquet-data-organization">Parquet data organization</h3>
<p><img src="https://media.datacamp.com/cms/ad_4nxcuuincavq5rqwc42rsxrqtf_hrepxa5zaohmvbkyjdivivu2p79s8pkbiov5ws85byacezrthjzpkg_uk-b1gybmog8fszuf_edkdle1j36eixnmhqb7unprq4emw4phm__zrp.png" alt="Parque file internal structure." /></p>
<p><strong>Directory → Part Files → Row Groups → Column Chunks → Pages</strong></p>
<p>A single Parquet file can contain <strong>multiple Row Groups</strong>, each holding a horizontal slice of the dataset. The number and size of Row Groups within a file are determined by the <strong>configured Row Group size</strong> and the amount of data written.</p>
<h3 id="heading-row-groups"><strong>Row Groups</strong></h3>
<ul>
<li><p><strong>Horizontal partition</strong> of data inside a file.</p>
</li>
<li><p>Each Row Group contains <strong>all columns</strong> for a subset of rows.</p>
</li>
<li><p>Default size ≈ <strong>128 MB</strong> (configurable).</p>
</li>
<li><p>Optimized for <strong>parallel reads</strong>: Each Row Group can be read independently.</p>
</li>
</ul>
<h3 id="heading-column-chunks"><strong>Column Chunks</strong></h3>
<ul>
<li><p><strong>Vertical partition</strong> of data inside a Row Group.</p>
</li>
<li><p>Each Column Chunk contains the values of <strong>a single column</strong> for all rows in that Row Group.</p>
</li>
<li><p><strong>Statistics for each column chunk</strong>:</p>
<ul>
<li><p>Minimum value</p>
</li>
<li><p>Maximum value</p>
</li>
<li><p>Null count</p>
</li>
</ul>
</li>
<li><p>Enables:</p>
<ul>
<li><p><strong>Projection pushdown</strong> (read only required columns and with the above statistics we can skip the row groups that doesn’t match the filter conditions)</p>
</li>
<li><p><strong>Better compression</strong> (similar values are adjacent)</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-pages"><strong>Pages</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754125750300/c93896cb-431f-43a4-901e-a8eb5c428c34.png" alt class="image--center mx-auto" /></p>
<p>The output above demonstrates that the inspected column chunk contains <strong>three pages</strong>, utilizing <strong>dictionary encoding</strong> wherever applicable.</p>
<ul>
<li><p><strong>Smallest unit of storage</strong> in a Column Chunk.</p>
</li>
<li><p>Types of pages:</p>
<ul>
<li><p><strong>Data Pages</strong>: Contain actual column values (can use encodings like Dictionary, RLE, Delta).</p>
</li>
<li><p><strong>Dictionary Pages</strong>: Contain unique values for dictionary encoding.</p>
</li>
<li><p><strong>Index Pages</strong>: Optional, for faster lookups.</p>
</li>
</ul>
</li>
<li><p>Default page size ≈ <strong>1 MB</strong>.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754124561357/45a19ddf-7704-4c6a-bd71-1212e7f64bb4.png" alt class="image--center mx-auto" /></p>
<p>Upon inspecting one of the Parquet part files using the <strong>parquet-cli</strong> tool, we observe that the file contains a <strong>single Row Group</strong> with approximately <strong>49,000 records</strong>. The Row Group metadata also includes <strong>column-level statistics</strong>, such as the <strong>minimum</strong> and <strong>maximum</strong> values for each column.</p>
<p>For example, if the <strong>Age</strong> column in this file has a recorded maximum value of <strong>70</strong>, and a query requests only records where <code>Age &gt; 70</code>, the query engine can determine—based on these statistics—that no matching rows exist in this file. As a result, the engine can <strong>skip reading this file entirely</strong>, avoiding unnecessary I/O. This optimization, known as <strong>predicate pushdown</strong>, is one of the key reasons why Parquet performs so well for analytical workloads.</p>
<h3 id="heading-encoding-schemes">Encoding Schemes</h3>
<h3 id="heading-dictionary-encoding"><strong>Dictionary Encoding</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754126077173/81d2cad5-dff9-4756-aed7-382b8b961252.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>In the example above, the <strong>Column Chunk for</strong> <code>city</code> includes a <strong>Dictionary Page</strong> that stores the unique city names present in the Row Group. All subsequent Data Pages within this column reference these values through <strong>dictionary indexes</strong>, rather than storing the full strings repeatedly. This approach significantly reduces storage size and improves read performance.</p>
</li>
<li><p><strong>How it works</strong>:</p>
<ul>
<li><p>Stores a <strong>dictionary</strong> of unique values once</p>
</li>
<li><p>Replaces actual values with <strong>integer indexes</strong> into the dictionary</p>
</li>
</ul>
</li>
<li><p><strong>When used</strong>: Columns with <strong>low cardinality</strong> (few distinct values)</p>
</li>
<li><p><strong>Pros</strong>: Significant space saving</p>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-markdown">  Original Values: [London, Paris, London, New York, Paris]
  Dictionary: {0: London, 1: Paris, 2: New York}
  Encoded Indexes: [0, 1, 0, 2, 1]
</code></pre>
</li>
</ul>
<h3 id="heading-run-length-encoding-rle"><strong>Run-Length Encoding (RLE)</strong></h3>
<ul>
<li><p><strong>How it works</strong>: Compresses <strong>consecutive repeated values</strong> as a single value + count</p>
</li>
<li><p><strong>When used</strong>: Columns with <strong>runs of repeated values</strong></p>
</li>
<li><p><strong>Pros</strong>: Excellent for sorted / repeated data</p>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-markdown">  Original Values: [A, A, A, A, B, B]
  Encoded: (A,4), (B,2)
</code></pre>
</li>
</ul>
<h3 id="heading-delta-encoding-delta-binary-packed"><strong>Delta Encoding (Delta Binary Packed)</strong></h3>
<ul>
<li><p><strong>How it works</strong>:</p>
<ul>
<li><p>Stores the <strong>difference (delta)</strong> between consecutive values</p>
</li>
<li><p>Often combined with bit packing for integers</p>
</li>
</ul>
</li>
<li><p><strong>When used</strong>: Numeric columns where values are sorted or increment gradually</p>
</li>
</ul>
<h3 id="heading-bit-packing-boolean-packing"><strong>Bit Packing / Boolean Packing</strong></h3>
<ul>
<li><p><strong>How it works</strong>: Stores multiple boolean values or small integers in a <strong>single byte or word</strong> by packing bits tightly.</p>
</li>
<li><p><strong>When used</strong>: Boolean or low-cardinality integer fields</p>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-markdown">  Boolean Values: [T, F, T, T, F, F, T, F]
  Encoded as: 10110010 (binary)
</code></pre>
</li>
</ul>
<p><strong>Plain Encoding</strong></p>
<ul>
<li><p><strong>How it works</strong>: Stores values in their raw form (uncompressed binary representation).</p>
</li>
<li><p><strong>When used</strong>:</p>
<ul>
<li>As a fallback when no encoding benefits are possible</li>
</ul>
</li>
<li><p><strong>Pros</strong>: Fast decode speed</p>
</li>
<li><p><strong>Cons</strong>: Larger size if values are repetitive</p>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-markdown">  Values: [25, 30, 35, 40]
  Encoded: 25 30 35 40 (no transformation)
</code></pre>
<p>  Parquet’s <strong>hybrid columnar format</strong> makes it compact, efficient, and fast for <strong>analytical workloads</strong>. With features like <strong>projection pushdown, predicate pushdown, and advanced encodings</strong>, it minimizes storage and speeds up queries.</p>
</li>
</ul>
<p>Thus, Parquet has become the <strong>default storage format for modern open table formats</strong> such as <strong>Apache Iceberg</strong>, <strong>Delta Lake</strong>, and <strong>Apache Hudi</strong>, thanks to its efficiency, scalability, and compatibility with analytical workloads.</p>
<p>I hope this article added value to your understanding of Parquet internals. <strong>Thanks for reading!</strong> 🚀</p>
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://www.youtube.com/watch?v=1j8SdS7s_NY">The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)</a></p>
</li>
<li><p><a target="_blank" href="https://dev.to/alexmercedcoder/all-about-parquet-part-06-encoding-in-parquet-optimizing-for-storage-4hh3">Encoding in Parquet | Optimizing for Storage</a></p>
</li>
<li><p>LLM Models - GPT / Claude</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Bored of Prompt Engineering? Try DSPy]]></title><description><![CDATA[With the advent of LLMs in the recent few years, we stumbled upon something called Prompt Engineering - let’s speak a few words about it to set the context right and then let’s venture into DSPy

What is Prompt Engineering?
Prompt Engineering is the ...]]></description><link>https://blog.praghadeesh.com/bored-of-prompt-engineering-try-dspy</link><guid isPermaLink="true">https://blog.praghadeesh.com/bored-of-prompt-engineering-try-dspy</guid><category><![CDATA[Programming Blogs]]></category><category><![CDATA[dspy]]></category><category><![CDATA[#PromptEngineering]]></category><category><![CDATA[llm]]></category><category><![CDATA[openai]]></category><dc:creator><![CDATA[Praghadeesh T K S]]></dc:creator><pubDate>Sun, 13 Jul 2025 08:37:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/tikhtH3QRSQ/upload/473d6bcc9b6dc19574dcaeac0f647064.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With the advent of LLMs in the recent few years, we stumbled upon something called Prompt Engineering - let’s speak a few words about it to set the context right and then let’s venture into DSPy</p>
<p><img src="https://apipie.ai/docs/img/Integrations/DSPy/dspy_logo.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-what-is-prompt-engineering">What is Prompt Engineering?</h2>
<p>Prompt Engineering is the art of designing the instructions that we give to a language model, to get the desired result. Language models don’t really understand the intentions of the user, they just predict the next word based on the provided prompt, thus prompting (right) becomes very crucial.</p>
<p>Though there are different types of prompts and engineering associated with it, It usually comprises of the following steps</p>
<ul>
<li><p>Choosing the right instruction</p>
</li>
<li><p>Providing examples</p>
</li>
<li><p>Specifying the format</p>
</li>
<li><p>Testing and refining</p>
</li>
</ul>
<p>Well, If you’ve worked for some time with the prompts, you would have been quickly acquainted with its limitations and tedious process of trial and error associated with crafting the right prompt</p>
<h2 id="heading-challenges-with-prompt-engineering">Challenges with Prompt Engineering</h2>
<ul>
<li><p>Writing the prompts, tweaking them with trial and error approaches</p>
</li>
<li><p>Prompt can grow bigger and heavy, spanning to thousands of words (expensive)</p>
</li>
<li><p>Heavily biased with Language models, often not LM agnostic (may work with GPT, but with LLaMa?)</p>
</li>
</ul>
<p>What if a tool or framework can solve some or all of the above problems, that is DSPy!</p>
<h2 id="heading-programming-not-prompting-whats-dspy">Programming not Prompting, What’s DSPy?</h2>
<p>DSPy is an open source declarative framework in Python that let’s you to define the steps of LLM workflow as modular components, compose them into pipelines, optimize prompts with feedback loops. This shifts the narrative that working with LLMs is more about prompting to think of it more like a program.</p>
<p>Just define the input and outputs, and the rest will be take care of by DSPy</p>
<ul>
<li><p>Declarative approach</p>
</li>
<li><p>Optimizers</p>
</li>
<li><p>Composable chains</p>
</li>
<li><p>Open source and Flexible</p>
</li>
</ul>
<h2 id="heading-core-components-of-dspy">Core components of DSPy</h2>
<p>The core components of DSPy are</p>
<ul>
<li><p>Signature</p>
</li>
<li><p>Modules</p>
</li>
<li><p>Adapters</p>
</li>
<li><p>Optimizer</p>
</li>
</ul>
<p>We can look at these more in details, as we walk through the example</p>
<h3 id="heading-configuring-the-llm">Configuring the LLM</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> dspy
dspy.settings.configure(lm=dspy.LM(<span class="hljs-string">"openai/gpt-4o-mini"</span>))
</code></pre>
<p>In the above piece of code, we are just refering the language model we’ll want to use. This assumes that you have the dspy package installed on your machine and OpenAI API key is already set as an environment variable for the code to pick it up</p>
<h3 id="heading-lets-define-signature">Let’s define signature</h3>
<p>A <strong>Signature</strong> in DSPy is like a <strong>contract</strong> that clearly describes:</p>
<p>✅ <strong>What inputs your module expects</strong><br />✅ <strong>What outputs it should produce</strong></p>
<p>Think of it as a <em>blueprint</em> that tells DSPy and the language model:</p>
<ul>
<li><p><em>Here are the fields you need to fill in.</em></p>
</li>
<li><p><em>Here’s what the output should look like.</em></p>
</li>
</ul>
<p>Signature can be class based or can be Inline signatures</p>
<pre><code class="lang-python"><span class="hljs-comment"># Class based Signature</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SentimentClassifier</span>(<span class="hljs-params">dspy.Signature</span>):</span>
    <span class="hljs-string">"""Classify the sentiment of a text."""</span>

    text: str = dspy.InputField(desc=<span class="hljs-string">"input text to classify sentiment"</span>)
    sentiment: int = dspy.OutputField(
        desc=<span class="hljs-string">"sentiment, the higher the more positive"</span>, ge=<span class="hljs-number">0</span>, le=<span class="hljs-number">10</span>
    )

<span class="hljs-comment"># Inline Signature</span>
str_signature = dspy.make_signature(<span class="hljs-string">"text -&gt; sentiment"</span>)
</code></pre>
<p>As we can see in the above code snippet, we have created a Signature with Input and Output fields and what they can be, this enforces the behaviour of the LLM.</p>
<h3 id="heading-modules">Modules</h3>
<p>Encapsulates the logic for interacting with LMs, the simplest module that DSPy provides is <code>dspy.Predict</code>. Multiple modules can be chained together and can be used as a bigger module.</p>
<p>There are different types of modules that DSPy provides us out of the box</p>
<ul>
<li><p><code>dspy.Predict</code> - Simple LLM interaction, often the building block for other complex modules</p>
</li>
<li><p><code>dspy.ChainOfThought</code> - Asks LM to add reasoning as part of the final answer, this modifies the signature</p>
</li>
<li><p><code>dspy.ProgramOfThought</code> - Asks LM to output code, whose execution results will dictate the response.</p>
</li>
<li><p><code>dspy.ReAct</code> - An agent that can use tools to implement the given signature.</p>
</li>
</ul>
<pre><code class="lang-python">predict = dspy.Predict(SentimentClassifier) 

output = predict(text=<span class="hljs-string">"I am feeling pretty happy!"</span>)
print(<span class="hljs-string">f"The sentiment is: <span class="hljs-subst">{output.sentiment}</span>"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752388540878/98941db3-c436-425b-975c-a838a5ac17e1.png" alt class="image--center mx-auto" /></p>
<p>In the above code snippet, we have used the most fundamental module <code>dspy.Predict</code>. Let’s look at the prompt it has generated</p>
<pre><code class="lang-python">dspy.inspect_history(n=<span class="hljs-number">1</span>)
</code></pre>
<p>The above piece of snippet gives the prompt that was generated by dspy.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752394800268/59d6819e-55fc-46e6-b3dd-908fae1a9046.png" alt class="image--center mx-auto" /></p>
<p>Let’s try the same with <code>dspy.ChainOfThought</code> Module and see what it does to the signature</p>
<pre><code class="lang-python">cot = dspy.ChainOfThought(SentimentClassifier)

output = cot(text=<span class="hljs-string">"I am feeling pretty happy!"</span>)
print(output)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752395459965/acae1bbd-86e3-4534-9598-b87602bd5f0f.png" alt class="image--center mx-auto" /></p>
<p>Let’s look at the prompt generated</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752395512364/d590e369-2f37-423e-ab30-121226d37b8c.png" alt class="image--center mx-auto" /></p>
<p>As we can see in the above prompt, the signature has been added with another output field called reasoning which was not user defined when used with <code>dspy.ChainOfThought</code> module. The reasoning field provides a justification on how it arrived to a specific output based on the user query.</p>
<p>I hope the above gives a primer to the DSPy framework, let’s discuss more about Adapters and Optimizers with a full fledged working example in the upcoming blog.</p>
<h3 id="heading-references">References:</h3>
<p><a target="_blank" href="https://www.deeplearning.ai/short-courses/dspy-build-optimize-agentic-apps/">DSPy: Build and Optimize Agentic Apps</a></p>
<p><a target="_blank" href="https://dspy.ai/learn/">DSPy: Learn</a></p>
]]></content:encoded></item><item><title><![CDATA[Understanding Single Tenant and Multi Tenant architecture ✨]]></title><description><![CDATA[Have you ever wondered about the difference between having a blog hosted in Ghost Instance and a blogging platform like Hashnode? 🤔
Have you ever wondered about the difference in living in a villa and an apartment? 🤔

The difference between the abo...]]></description><link>https://blog.praghadeesh.com/understanding-single-tenant-and-multi-tenant-architecture</link><guid isPermaLink="true">https://blog.praghadeesh.com/understanding-single-tenant-and-multi-tenant-architecture</guid><category><![CDATA[architecture]]></category><category><![CDATA[software development]]></category><category><![CDATA[SaaS]]></category><dc:creator><![CDATA[Praghadeesh T K S]]></dc:creator><pubDate>Tue, 24 Sep 2024 17:50:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/RfY1OQqlT3U/upload/3714ca2cf880a0e9b068db32933a639d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever wondered about the difference between having a blog hosted in <strong>Ghost Instance</strong> and a blogging platform like <strong>Hashnode</strong>? 🤔</p>
<p>Have you ever wondered about the difference in living in a <strong>villa</strong> and an <strong>apartment</strong>? 🤔</p>
<p><img src="https://www.greentechbuilders.in/uploads/media/untitled-design-1-610ce60b0d658.jpg" alt="Villas Vs Apartments : Things You Need To Know Before Choosing One" /></p>
<p>The difference between the above, pretty much sums up about <em>Single Tenant and Multi Tenant architecture that might be commonly stumbled upon in the Software ecosystem</em></p>
<p>When comparing the architecture of a <strong>villa</strong> and an <strong>apartment</strong>, think of it in terms of space, independence, and shared resources.</p>
<ul>
<li><p>🏘️ <strong>Villa Architecture</strong>: A villa is like a <strong>single-tenant</strong> setup in software architecture. It offers <strong>complete independence</strong> in terms of space, design, and customization. You have your own private land, no shared walls, and complete control over modifications, similar to how a single-tenant system allows a user full control over their environment. Villas also provide more space, privacy, and autonomy but come with higher maintenance responsibilities.</p>
</li>
<li><p>🏢 <strong>Apartment Architecture</strong>: An apartment, on the other hand, is akin to <strong>multi-tenant</strong> architecture. While you have your own private living space (your unit), you share walls, common areas, infrastructure, and amenities with other tenants. It’s more cost-effective, easier to maintain, and designed to accommodate multiple residents in a shared environment, just like how multi-tenant systems are designed to serve multiple users on shared infrastructure efficiently.</p>
</li>
</ul>
<p><img src="https://media.graphassets.com/sGAxyuKWToGN5MfgQdlO" alt="What is Multi-Tenancy and Why Do You Need a Multi-Tenant Architecture? |  Hygraph" /></p>
<h3 id="heading-single-tenant-architecture"><strong>Single-Tenant Architecture</strong></h3>
<p>In a <strong>single-tenant architecture</strong>, each customer or tenant has their own dedicated software instance, which includes separate application instance and database environments. Technically, the software is replicated for each customer, ensuring full data isolation and customization at the cost of increased resource consumption.</p>
<p>For instance, a media company running a popular blog might self-host Ghost to ensure <strong>complete ownership</strong> of their content, customize the blog’s performance, and manage sensitive data directly. This setup is preferred if they have <strong>specific privacy requirements</strong> or need to make <strong>extensive customizations</strong> that wouldn’t be possible in a shared multi-tenant environment.</p>
<p><img src="https://149842033.v2.pressablecdn.com/wp-content/uploads/2024/02/ghost-logo.png" alt="Ghost Blogging Platform: My Experience and Review - uiCookies" /></p>
<h4 id="heading-key-technical-characteristics"><strong>Key Technical Characteristics:</strong></h4>
<ul>
<li><p><strong>Dedicated Infrastructure:</strong> Each tenant has an isolated application and database stack, typically running on separate virtual machines or containers.</p>
</li>
<li><p><strong>Data Isolation:</strong> Complete separation of tenant data with no shared storage or compute resources.</p>
</li>
<li><p><strong>Customizability:</strong> High degree of flexibility in terms of software configurations, security rules, and application logic.</p>
</li>
<li><p><strong>Security:</strong> Since each tenant operates in an isolated environment, security risks are minimized due to lack of shared resources.</p>
</li>
</ul>
<hr />
<h3 id="heading-pros-of-single-tenant-architecture"><strong>Pros of Single-Tenant Architecture</strong></h3>
<ol>
<li><p><strong>Enhanced Security:</strong></p>
<ul>
<li><p>Full isolation ensures that each tenant's data is siloed from others. This prevents cross-tenant breaches that can occur in shared environments.</p>
</li>
<li><p>Attack vectors like shared database vulnerabilities are eliminated. Customers can configure their own security policies, firewalls, and encryption standards.</p>
</li>
</ul>
</li>
<li><p><strong>Full Customization:</strong></p>
<ul>
<li><p>Each instance can be highly tailored to the needs of the tenant, allowing custom features, unique workflows, and specific data structures.</p>
</li>
<li><p>Flexibility to modify backend configurations, database schemas, and even run custom plugins without affecting other customers.</p>
</li>
</ul>
</li>
<li><p><strong>Performance Stability:</strong></p>
<ul>
<li><p>No sharing of computational resources means consistent performance. High availability can be assured through isolated scaling and resource allocation.</p>
</li>
<li><p>Performance tuning for each tenant becomes easier. Tenants can deploy performance-enhancing services like caching, autoscaling, or load balancing without worrying about other tenants overloading the system.</p>
</li>
</ul>
</li>
</ol>
<hr />
<h3 id="heading-cons-of-single-tenant-architecture"><strong>Cons of Single-Tenant Architecture</strong></h3>
<ol>
<li><p><strong>Higher Costs:</strong></p>
<ul>
<li><p>The dedicated resources for each tenant make this architecture more expensive, as providers need to maintain separate instances for each customer.</p>
</li>
<li><p>Infrastructure costs can rise exponentially as each instance requires its own VM, storage, and network configurations. The provider also incurs higher operational overhead for managing, monitoring, and scaling individual instances.</p>
</li>
</ul>
</li>
<li><p><strong>Complex Maintenance and Upgrades:</strong></p>
<ul>
<li><p>Every tenant’s instance must be updated separately, which increases operational complexity, especially for bug fixes, security patches, and version upgrades.</p>
</li>
<li><p>Providers have to carefully manage version control and deployment pipelines for each instance. Automation tools like Ansible or Terraform become critical for managing infrastructure at scale.</p>
</li>
</ul>
</li>
<li><p><strong>Inefficient Resource Utilization:</strong></p>
<ul>
<li><p>Resources are allocated on a per-tenant basis, which often leads to underutilization.</p>
</li>
<li><p>CPU, memory, and storage resources might be wasted if a tenant doesn’t use their full capacity. This becomes inefficient compared to pooling resources across multiple tenants.</p>
</li>
</ul>
</li>
</ol>
<hr />
<h3 id="heading-multi-tenant-architecture"><strong>Multi-Tenant Architecture</strong></h3>
<p>In a <strong>multi-tenant architecture</strong>, multiple tenants share a single instance of the software and database. Although the resources are shared, each tenant’s data is segregated logically, usually at the database or application layer. I would consider <strong>SaaS applications</strong> are an ideal candidate for multi-tenant architecture.</p>
<p>A typical example is <strong>Shopify</strong>, where thousands of e-commerce stores share the same application instance, but each store has its own segregated data and custom configurations. Shopify scales efficiently by pooling resources across all tenants.</p>
<p><img src="https://img.intertoons.com/wp-content/uploads/2024/08/Shopify-ecommerce-platform.png.webp" alt="Top 10 Fashion Websites on Shopify - Intertoons Internet Services Pvt.Ltd." /></p>
<h4 id="heading-key-technical-characteristics-1"><strong>Key Technical Characteristics:</strong></h4>
<ul>
<li><p><strong>Shared Infrastructure:</strong> All tenants share a single instance of the software, often including the application server, database, and compute resources.</p>
</li>
<li><p><strong>Data Segregation:</strong> Data is logically separated either at the database level (e.g., separate tables per tenant) or using a single database with tenant-specific tags (e.g., a <code>tenant_id</code> field in each record) with other techniques such as RLS (Row Level Security) is leveraged.</p>
</li>
<li><p><strong>Elastic Resource Utilization:</strong> Resources (CPU, memory, storage) are pooled and dynamically allocated based on tenant needs.</p>
</li>
<li><p><strong>Scalability:</strong> It’s easier for the provider to scale the system because they only need to scale one instance rather than multiple isolated instances.</p>
</li>
</ul>
<hr />
<h3 id="heading-pros-of-multi-tenant-architecture"><strong>Pros of Multi-Tenant Architecture</strong></h3>
<ol>
<li><p><strong>Cost Efficiency:</strong></p>
<ul>
<li><p>By sharing infrastructure, the cost per tenant is significantly reduced. Providers can optimize hardware and software resources across a larger number of customers.</p>
</li>
<li><p>Tenants share common resources like load balancers, databases, and application servers, leading to reduced operational costs and more efficient scaling.</p>
</li>
</ul>
</li>
<li><p><strong>Simplified Maintenance:</strong></p>
<ul>
<li><p>Updates, bug fixes, and patches can be rolled out to all tenants at once, reducing the complexity of version management.</p>
</li>
<li><p>CI/CD pipelines are streamlined as the provider needs to manage only one instance. DevOps tools like Kubernetes can be used to automate rolling updates and deployments.</p>
</li>
</ul>
</li>
<li><p><strong>Scalability:</strong></p>
<ul>
<li><p>Multi-tenant architectures scale horizontally by adding more tenants to the same instance. Resources can be dynamically allocated based on load, allowing better handling of peak usage.</p>
</li>
<li><p>With autoscaling features in cloud platforms (e.g., AWS, Google Cloud), the provider can elastically scale the system to handle increased traffic without requiring separate infrastructure for each tenant.</p>
</li>
</ul>
</li>
</ol>
<hr />
<h3 id="heading-cons-of-multi-tenant-architecture"><strong>Cons of Multi-Tenant Architecture</strong></h3>
<ol>
<li><p><strong>Limited Customization:</strong></p>
<ul>
<li><p>Since multiple tenants share the same application and database, there is less flexibility in customization. Changes made to the software affect all tenants.</p>
</li>
<li><p>Tenant-specific configurations are typically limited to front-end settings or user-level preferences, with minimal ability to alter core application logic or database schemas.</p>
</li>
</ul>
</li>
<li><p><strong>Security Risks:</strong></p>
<ul>
<li><p>Although tenants’ data is logically separated, a security vulnerability in the shared infrastructure could expose data across tenants.</p>
</li>
<li><p>A poorly configured shared database or a misconfigured security policy can lead to data leakage across tenants.</p>
</li>
</ul>
</li>
<li><p><strong>Performance Fluctuations:</strong></p>
<ul>
<li><p>Resource usage spikes from one tenant can affect the performance of other tenants sharing the same infrastructure.</p>
</li>
<li><p>Even with resource limits, a heavy-load tenant can degrade performance for others. Providers might need to consider performance tuning and load balancing to handle such cases.</p>
</li>
</ul>
</li>
</ol>
<hr />
<h3 id="heading-key-differences"><strong>Key differences</strong></h3>
<p>The decision between single-tenant and multi-tenant architecture depends on your business and technical requirements, particularly in the areas of cost, security, and scalability.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>Single-Tenant</td><td>Multi-Tenant</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Customization</strong></td><td>Customization completely in control</td><td>Limited Customization</td></tr>
<tr>
<td><strong>Cost</strong></td><td>Dedicated and Isolated infrastructure</td><td>Comman and shared resources</td></tr>
<tr>
<td><strong>Security</strong></td><td>Complete data isolation</td><td>Shared infrastructure risk</td></tr>
<tr>
<td><strong>Scalability</strong></td><td>Low - Vertical scaling per instance</td><td>High - Horizontal scaling across tenants</td></tr>
<tr>
<td><strong>Resource Utilization</strong></td><td>Isolated instances</td><td>Pooled resources, efficient use</td></tr>
<tr>
<td><strong>Maintenance</strong></td><td>Pretty complex as the number of instances increases</td><td>Simple - Centralized updates and management</td></tr>
</tbody>
</table>
</div><hr />
<h3 id="heading-summary">☺️ <strong>Summary</strong></h3>
<p><strong>Single-tenant architecture</strong> provides maximum control, customization, and security, making it ideal for industries with strict compliance requirements (e.g., healthcare, banking). However, it comes at a higher cost and greater complexity in terms of maintenance and scaling.</p>
<p><strong>Multi-tenant architecture</strong>, on the other hand, is highly scalable, cost-effective, and easier to maintain, making it the go-to choice for most SaaS providers serving a broad range of customers with standard needs.</p>
]]></content:encoded></item><item><title><![CDATA[Advancing RAG with unstructured.io]]></title><description><![CDATA[Hello All, This is Praghadeesh back to writing blogs after a while (I lost my previously hosted Ghost Instance with no backups and had to start from scratch 😕). In this blog, let's explore a bit more on RAG by trying to work on some complex PDFs lev...]]></description><link>https://blog.praghadeesh.com/advancing-rag-with-unstructuredio</link><guid isPermaLink="true">https://blog.praghadeesh.com/advancing-rag-with-unstructuredio</guid><category><![CDATA[Unstructured.io]]></category><category><![CDATA[multivectorretriever]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[langchain]]></category><category><![CDATA[openai]]></category><category><![CDATA[#GoogleGemini]]></category><dc:creator><![CDATA[Praghadeesh T K S]]></dc:creator><pubDate>Sat, 06 Jul 2024 16:56:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1720272381721/39527dd1-1e8d-40eb-95f0-673eb1482134.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hello All, This is Praghadeesh back to writing blogs after a while (I lost my previously hosted Ghost Instance with no backups and had to start from scratch 😕). In this blog, let's explore a bit more on RAG by trying to work on some complex PDFs leveraging the capabilities of unstructured.io and Langchain's MultiVectorRetriever this time.</p>
<h3 id="heading-what-is-unstructuredio">What is unstructured.io?</h3>
<p>Unstructured.io is an open source project that provides tools with capabilities to work on diverse source of documents such as PDF, HTML and so on and helps us to streamline the data processing workflow for LLMs. It's more of an ETL tool for Gen AI use cases. It comes in three different offerings</p>
<ul>
<li><p>Serverless API</p>
</li>
<li><p>Azure/AWS Marketplace offering</p>
</li>
<li><p>Self hostable solution</p>
</li>
</ul>
<h3 id="heading-what-is-rag-and-why-to-use-unstructuredio-with-rag">What is RAG and why to use unstructured.io with RAG?</h3>
<p>If the title of the blog interested you and you are already here reading the blog, you might probably know what RAG is all about. In oversimplified terms, it's just the art of injecting context to the LLMs where the goal is to help them answer questions that is beyond the training data of LLM. I believe this might be a perfect analogy, It's like an Open Book Exam, where you try to find the relavant content from the book and make sense out of it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1720270575165/c542acc3-4b99-44fa-b54e-686d04ca1c01.png" alt="Open Book Scene from the Tamil Movie Nanban" class="image--center mx-auto" /></p>
<p>Cool, but how unstructured helps here? The process of RAG becomes complex when we try to deal with diverse contents such as Tables, Images, Vector Diagrams, Formulae and so on. Unstructured.io will help us to work with some of these data and make our job a bit easier, the scope of the blog is limited to handing the data in tabular format in complex PDFs.</p>
<h3 id="heading-working-with-complex-pdfs">Working with Complex PDFs</h3>
<p>Complex PDFs may involve Financial Reports, Scientific Research papers, Technical Reference Document, Engineering Datasheets and so on. In this blog, let's try dealing with a datasheet for an Electrical component called LM-317 a linear voltage regulator.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1720271544155/655bd0e9-b5a2-4088-9aa2-665cd119d683.png" alt="LM317 Datasheet" class="image--center mx-auto" /></p>
<p>The above is an example of how the content of the <a target="_blank" href="https://www.ti.com/lit/ds/symlink/lm317.pdf?ts=1720240009145&amp;ref_url=https%253A%252F%252Fwww.google.com%252F">datasheet</a> looks like, it has multiple pages with such tables and vector diagrams where extracting data without loosing quality might not be possible with traditional RAG.</p>
<p><strong>Semi Structured RAG with Multi Vector Retreiver</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1720282218262/e898b4e5-8347-43ac-8062-13284ceb427b.png" alt="Semi Structured RAG" class="image--center mx-auto" /></p>
<ul>
<li><p>The idea here is to extract text and table chunks separately as shown above using unstructured</p>
</li>
<li><p>Create a summarization chain and generate summary for texts and tables</p>
</li>
<li><p>Ingest the text and table summary with corresponding embeddings into the vector store</p>
</li>
<li><p>Ingest the Raw chunks into the docstore or memorystore</p>
</li>
<li><p>Query against the summary embeddings, retrieve the corresponding chunks from docstore associated with summary in vectorstore and pass the chunk to the LLM to make sense out of it</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Note: This blog covers a high level implementation of the code to get the context right. The details of unstructured capabilities and code breakdown will be coverted in the upcoming blogs.</div>
</div>

<p><strong>Partitioning the PDF document using unstructured</strong></p>
<pre><code class="lang-python">unstruct_client = UnstructuredClient(
    api_key_auth=os.getenv(<span class="hljs-string">"UNSTRUCTURED_API_AUTH_KEY"</span>)
)

filename = <span class="hljs-string">"lm317.pdf"</span>

<span class="hljs-keyword">with</span> open(filename, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy=<span class="hljs-string">"hi_res"</span>,
    hi_res_model_name=<span class="hljs-string">"yolox"</span>,
    skip_infer_table_types=[],
    pdf_infer_table_structure=<span class="hljs-literal">True</span>
)

<span class="hljs-keyword">try</span>:
    resp = unstruct_client.general.partition(req)
    pdf_elements = dict_to_elements(resp.elements)
<span class="hljs-keyword">except</span> SDKError <span class="hljs-keyword">as</span> e:
    print(e)
</code></pre>
<p>In the above part of code, the partitioning of PDF is executed. Unstructured simplifies the preprocessing of structured and unstructured documents for downstream tasks, irrespective of what type of file content is provided as source. When partioned the result is a list of Element objects.</p>
<p>The below is an example of how the Partition output looks like, the Elements can be of type <code>Title, NarrativeText, Image, Table, ListItem, Header, Footer</code> and so on.</p>
<pre><code class="lang-json">{
       <span class="hljs-attr">"type"</span>:<span class="hljs-string">"Title"</span>,
       <span class="hljs-attr">"element_id"</span>:<span class="hljs-string">"d8ecdee23702fdb35f98390141100d13"</span>,
       <span class="hljs-attr">"text"</span>:<span class="hljs-string">"from 1.25 V to 37 V"</span>,
       <span class="hljs-attr">"metadata"</span>:{
          <span class="hljs-attr">"filetype"</span>:<span class="hljs-string">"application/pdf"</span>,
          <span class="hljs-attr">"languages"</span>:[
             <span class="hljs-string">"eng"</span>
          ],
          <span class="hljs-attr">"page_number"</span>:<span class="hljs-number">1</span>,
          <span class="hljs-attr">"filename"</span>:<span class="hljs-string">"lm317.pdf"</span>
       }
    },
    {
       <span class="hljs-attr">"type"</span>:<span class="hljs-string">"ListItem"</span>,
       <span class="hljs-attr">"element_id"</span>:<span class="hljs-string">"eb105b9f3e577473acac7ba394cea3c7"</span>,
       <span class="hljs-attr">"text"</span>:<span class="hljs-string">"Output current greater than 1.5 A • • Thermal overload protection • Output safe-area compensation"</span>,
       <span class="hljs-attr">"metadata"</span>:{
          <span class="hljs-attr">"filetype"</span>:<span class="hljs-string">"application/pdf"</span>,
          <span class="hljs-attr">"languages"</span>:[
             <span class="hljs-string">"eng"</span>
          ],
          <span class="hljs-attr">"page_number"</span>:<span class="hljs-number">1</span>,
          <span class="hljs-attr">"parent_id"</span>:<span class="hljs-string">"d8ecdee23702fdb35f98390141100d13"</span>,
          <span class="hljs-attr">"filename"</span>:<span class="hljs-string">"lm317.pdf"</span>
       }
    },
</code></pre>
<p><strong>Chunking the elements obtained after partitioning</strong></p>
<p>The partitions created are then chunked using the chunking strategy - chunk_by_title.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The <code>by_title</code> chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.</div>
</div>

<p>The chunks are categrozied as table chunks and text chunks respectively and a summary chain is created using the Google Gemini Pro model, which will help us creating a list of table summaries and text summaries.</p>
<pre><code class="lang-python">chunks = chunk_by_title(pdf_elements,max_characters=<span class="hljs-number">4000</span>,new_after_n_chars=<span class="hljs-number">3800</span>, combine_text_under_n_chars=<span class="hljs-number">2000</span>)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Element</span>(<span class="hljs-params">BaseModel</span>):</span>
    type: str
    text: Any

<span class="hljs-comment"># Categorize by type</span>
categorized_elements = []
<span class="hljs-keyword">for</span> element <span class="hljs-keyword">in</span> chunks:
    <span class="hljs-keyword">if</span> <span class="hljs-string">"unstructured.documents.elements.Table"</span> <span class="hljs-keyword">in</span> str(type(element)):
        categorized_elements.append(Element(type=<span class="hljs-string">"table"</span>, text=str(element)))
    <span class="hljs-keyword">elif</span> <span class="hljs-string">"unstructured.documents.elements.CompositeElement"</span> <span class="hljs-keyword">in</span> str(type(element)):
        categorized_elements.append(Element(type=<span class="hljs-string">"text"</span>, text=str(element)))

<span class="hljs-comment"># Tables</span>
table_elements = [e <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> categorized_elements <span class="hljs-keyword">if</span> e.type == <span class="hljs-string">"table"</span>]
print(len(table_elements))

<span class="hljs-comment"># Text</span>
text_elements = [e <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> categorized_elements <span class="hljs-keyword">if</span> e.type == <span class="hljs-string">"text"</span>]
print(len(text_elements))

<span class="hljs-comment"># Prompt</span>
prompt_text = <span class="hljs-string">"""You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """</span>
prompt = ChatPromptTemplate.from_template(prompt_text)

<span class="hljs-comment"># Summary chain</span>
model = ChatGoogleGenerativeAI(model=<span class="hljs-string">"gemini-pro"</span>)
summarize_chain = {<span class="hljs-string">"element"</span>: <span class="hljs-keyword">lambda</span> x: x} | prompt | model | StrOutputParser()

<span class="hljs-comment"># Apply to tables</span>
tables = [i.text <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> table_elements]
table_summaries = summarize_chain.batch(tables, {<span class="hljs-string">"max_concurrency"</span>: <span class="hljs-number">5</span>})

<span class="hljs-comment"># Apply to texts</span>
texts = [i.text <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> text_elements]
text_summaries = summarize_chain.batch(texts, {<span class="hljs-string">"max_concurrency"</span>: <span class="hljs-number">1</span>})
</code></pre>
<p><strong>Adding the Summaries and Documents to Vector Store and Doc Store</strong></p>
<p>The summaries are added to the vector store (ChromaDB in this case) and the raw chunks are added to the docstore both mapped with a uid.</p>
<pre><code class="lang-python"><span class="hljs-comment"># The vectorstore to use to index the child chunks</span>
vectorstore = Chroma(collection_name=<span class="hljs-string">"summaries"</span>, embedding_function=FastEmbedEmbeddings())

<span class="hljs-comment"># The storage layer for the parent documents</span>
store = InMemoryStore()
id_key = <span class="hljs-string">"doc_id"</span>

<span class="hljs-comment"># The retriever (empty to start)</span>
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

<span class="hljs-comment"># Add texts</span>
doc_ids = [str(uuid.uuid4()) <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    <span class="hljs-keyword">for</span> i, s <span class="hljs-keyword">in</span> enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

<span class="hljs-comment"># Add tables</span>
table_ids = [str(uuid.uuid4()) <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    <span class="hljs-keyword">for</span> i, s <span class="hljs-keyword">in</span> enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))
</code></pre>
<p><strong>Creating the answer chain</strong></p>
<p>As a final process, the RAG chain is created and the query is passed as an input to the RAG chain.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Prompt template</span>
template = <span class="hljs-string">"""Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""</span>
prompt = ChatPromptTemplate.from_template(template)

<span class="hljs-comment"># LLM</span>
model = ChatGoogleGenerativeAI(model=<span class="hljs-string">"gemini-pro"</span>)

<span class="hljs-comment"># RAG pipeline</span>
chain = (
    {<span class="hljs-string">"context"</span>: retriever, <span class="hljs-string">"question"</span>: RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1720281281257/c7b8f2ab-26de-4ff0-bb12-d84f02e8b541.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1720284824135/a248e398-cb1c-4aa0-b89f-39cbe9164140.png" alt class="image--center mx-auto" /></p>
<p>As we can see above, the <strong><em>LLM Chain is able to provide us with accurate results from the tables present in the datasheet of LM317 Linear Voltage Regulator.</em></strong></p>
<p>References<br />1. <a target="_blank" href="https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb">Semi Structured RAG Cookbook</a><br />2. <a target="_blank" href="https://docs.unstructured.io/welcome">Unstructured IO Documentation</a></p>
]]></content:encoded></item></channel></rss>