In [6]:
Copied!
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
documents = SimpleWebPageReader(html_to_text=True).load_data(
["https://www.thoughtworks.com/en-in/insights/blog/data-strategy/building-an-amazon-com-for-your-data-products"]
)
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
documents = SimpleWebPageReader(html_to_text=True).load_data(
["https://www.thoughtworks.com/en-in/insights/blog/data-strategy/building-an-amazon-com-for-your-data-products"]
)
In [3]:
Copied!
documents[0]
documents[0]
Out[3]:
Document(id_='https://www.thoughtworks.com/en-in/insights/blog/data-strategy/building-an-amazon-com-for-your-data-products', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='[ ![Thoughtworks](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/thoughtworks-logo.svg) ](/en-in "Thoughtworks")\n\nMenu\n\nClose\n\n * [What we do ](/en-in/what-we-do "What we do")\n\n * [ Go to overview ](/en-in/what-we-do)\n * ### Services\n\n * [ Artificial Intelligence ](/en-in/what-we-do/ai)\n * [ Cloud ](/en-in/what-we-do/cloud)\n * [ Customer Experience and Products ](/en-in/what-we-do/customer-experience-product-design)\n * [ Data and Analytics ](/en-in/what-we-do/data)\n * [ Managed Services ](/en-in/what-we-do/digital-application-management-and-operations)\n * [ Modernization ](/en-in/what-we-do/modernization)\n * [ Platforms ](/en-in/what-we-do/platforms)\n\n * [Who we work with ](/en-in/clients "Who we work with")\n\n * [ Go to overview ](/en-in/clients)\n * [Automotive ](/en-in/clients/automotive "Automotive")\n * [Healthcare and Life Sciences ](/en-in/clients/healthcare "Healthcare and Life Sciences")\n * [Public Sector ](/en-in/clients/public-sector "Public Sector")\n * [Cleantech, Energy and Utilities ](/en-in/clients/cleantech-energy-utilities "Cleantech, Energy and Utilities")\n * [Media and Publishing ](/en-in/clients/media-publishing "Media and Publishing")\n * [Retail and E-commerce ](/en-in/clients/retail-ecommerce "Retail and E-commerce")\n * [Financial Services and Insurance ](/en-in/clients/financial-services-insurance "Financial Services and Insurance")\n * [Not-for-profit ](/en-in/clients/not-for-profit "Not-for-profit")\n * [Travel and Transport ](/en-in/clients/travel-transport "Travel and Transport")\n\n * [Insights ](/en-in/insights "Insights")\n\n * [ Go to overview ](/en-in/insights)\n * Loading\n\n###\n\n * ### Resource Hubs\n\n * [ Technology \n\nEnterprise technology and engineering excellence\n\n](/en-in/insights/technology)\n\n * [ Business \n\nBusiness and industry insights for digital leaders\n\n](/en-in/insights/business)\n\n * [ Culture \n\nExplore what it means to be a Thoughtworker\n\n](/en-in/insights/culture)\n\n * ### Publications and Tools\n\n * [ Technology Radar \n\nAn opinionated guide to today\'s technology landscape\n\n](/en-in/radar)\n\n * [ Perspectives \n\nA no-nonsense publication for digital leaders\n\n](/en-in/perspectives)\n\n * [ Digital Fluency Model \n\nA model to help you build a resilient business\n\n](/en-in/digital-fluency)\n\n * [ Decoder \n\nThe business execs\' A-Z guide to technology\n\n](/en-in/insights/decoder)\n\n * [ Looking Glass \n\nBringing the tech-led business changes into focus\n\n](/en-in/insights/looking-glass)\n\n * ### All Insights\n\n * [ Articles \n\nIn-depth insights to help your business grow\n\n](/en-in/insights/articles)\n\n * [ Blogs \n\nExpert advice on strategy, design, engineering, and careers\n\n](/en-in/insights/blog)\n\n * [ Books \n\nExplore our extensive library to keep learning\n\n](/en-in/insights/books)\n\n * [ Podcasts \n\nConversations on the latest in business and tech\n\n](/en-in/insights/podcasts)\n\n * [Careers ](/en-in/careers "Careers")\n\n * [ Go to overview ](/en-in/careers)\n * [Application Process \n\nWhat to expect as you interview with us\n\n](/en-in/careers/our-process "Application Process")\n\n * [Consultant Life \n\nLearn what life is like as a Thoughtworker\n\n](/en-in/careers/consultant-life "Consultant Life")\n\n * [Thoughtworks India graduates hiring ](/en-in/careers/graduates "Thoughtworks India graduates hiring")\n * [Search Jobs \n\nFind open positions in your region\n\n](/en-in/careers/jobs "Search Jobs")\n\n * [Stay Connected \n\nSign up for our monthly newsletter\n\n](/en-in/careers/access "Stay Connected")\n\n * [Learning and Development \n\nExplore how we support career growth\n\n](/en-in/careers/learning-and-development "Learning and Development")\n\n * [Benefits \n\nSee how we take care of our people\n\n](/en-in/careers/benefits "Benefits")\n\n * [About ](/en-in/about-us "About")\n\n * [ Go to overview ](/en-in/about-us)\n * [Our Purpose ](/en-in/about-us/our-purpose "Our Purpose")\n * [Diversity, Equity and Inclusion ](/en-in/about-us/diversity-and-inclusion "Diversity, Equity and Inclusion")\n * [Our History ](/en-in/about-us/history "Our History")\n * [Our Leaders ](/en-in/about-us/leaders "Our Leaders")\n * [Social Change ](/en-in/about-us/social-change "Social Change")\n * [News ](/en-in/about-us/news "News")\n * [Partnerships ](/en-in/about-us/partnerships "Partnerships")\n * [Sustainability ](/en-in/about-us/sustainability "Sustainability")\n * [Conferences and Events ](/en-in/about-us/events "Conferences and Events")\n * [Our Brand ](/en-in/about-us/brand "Our Brand")\n * [Awards and Recognition ](/en-in/about-us/awards-recognition "Awards and Recognition")\n\n * [Investors ](https://investors.thoughtworks.com/ "Investors")\n * [Contact ](/en-in/contact-us "Contact")\n\nSearch\n\nClose\n\n![](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/in.svg) India | English\n\n * ![Australia flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/au.svg) Australia\n\n[English](/en-au/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![Brazil flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/br.svg) Brazil\n\n[English](/en-br/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English") | [Português](/pt-br "Português")\n\n * ![Canada flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/ca.svg) Canada\n\n[English](/en-ca/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![Chile flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/cl.svg) Chile\n\n[English](/en-cl/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English") | [ Español](/es-cl " Español")\n\n * ![China flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/cn.svg) China\n\n[Hong Kong SAR (English)](/en-cn/insights/blog/data-strategy/building-an-\namazon-com-for-your-data-products "Hong Kong SAR \\(English\\)") | [Mainland\n(Chinese)](/zh-cn "Mainland \\(Chinese\\)")\n\n * ![Ecuador flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/ec.svg) Ecuador\n\n[English](/en-ec/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English") | [ Español](/es-ec " Español")\n\n * ![Germany flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/de.svg) Germany\n\n[English](/en-de/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English") | [Deutsch](/de-de "Deutsch")\n\n * ![India flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/in.svg) India\n\n[English](/en-in/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![Singapore flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/sg.svg) Singapore\n\n[English](/en-sg/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![Spain flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/es.svg) Spain\n\n[English](/en-es/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English") | [ Español](/es-es " Español")\n\n * ![Thailand flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/th.svg) Thailand\n\n[English](/en-th/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![United Kingdom flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/gb.svg) United Kingdom\n\n[English](/en-gb/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![United States flag](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/us.svg) United States\n\n[English](/en-us/insights/blog/data-strategy/building-an-amazon-com-for-your-\ndata-products "English")\n\n * ![Worldwide icon](/etc.clientlibs/thoughtworks/clientlibs/clientlib-site/resources/images/global.svg) Worldwide\n\n[English](/insights/blog/data-strategy/building-an-amazon-com-for-your-data-\nproducts "English")\n\n![](/content/dam/thoughtworks/images/photography/banner-\nimage/insights/in_banner_blogs.jpg)\n![]()\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n# Building An “Amazon.com” For Your Data Products\n\nSurfacing reliability SLIs and SLOs can boost adoption. Here’s how.\n\n[ Blogs Back ](/en-in/insights/blog) [ Blogs Back ](/en-in/insights/blog)\n\n![Social share button](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/share-fill.svg)\n\nClose\n\n[ Data strategy ](https://www.thoughtworks.com/insights/topic/data-strategy) [\nData mesh ](https://www.thoughtworks.com/insights/topic/data-mesh) [ Blog\n](/en-in/insights/blog)\n\nBy\n\n[Barr Moses](/en-in/profiles/b/barr-moses) ,\n\n[Manisha Jain](/en-in/profiles/m/manisha-jain) and\n\n[Pablo Porto](/en-in/profiles/p/pablo-porto)\n\nPublished: June 20, 2023\n\n![Customer 360 Data\nProduct](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n![Customer 360 Data\nProduct](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nHave you ever come across an internal [data\nproduct](https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/modern-\ndata-engineering-playbook/data-as-a-product) and side-eyed it like it’s your\nkid’s prom date? While it _seems_ like it fits the requirements, you don’t\nquite trust it — who knows where the data in this shifty table has been. Will\nit be reliable and safe even after you turn your focus elsewhere? Will the\nschema stay true?\n\nThis project is your baby; you just can’t risk it. So, just to be safe you\ntake the extra time to recreate the dataset.\n\n## Data products and trustworthiness\n\nAccording to Zhamak Dehgahi, data products should be discoverable,\naddressable, trustworthy, self-describing, interoperable and secure. In our\nexperience, most data products only support one or two use cases. That’s a\nlost opportunity experienced by too many data teams, especially those with\ndecentralized organizational structures or implementing [data\nmesh](https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-\nmesh-it-up/).\n\n![Data product characteristics as originally defined by Zhamak Dehghani.\n](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nData product characteristics as originally defined by Zhamak Dehghani.\n\n![Data product characteristics as originally defined by Zhamak Dehghani.\n](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nData product characteristics as originally defined by Zhamak Dehghani.\n\nIn the focus on building data trust with business stakeholders, it’s easy to\nlose sight of the importance of also building trust with data teams across\ndifferent domains. However, a data product must be trustworthy if it’s to\nencourage the reuse of data products. This is what ultimately **separates data\nmesh from data silo.**\n\nThe data product is trustworthy if data consumers are confident in the\naccuracy and reliability of the data. Data products should be transparent with\nregards to information quality metrics and performance promises.\n\nCreating a central marketplace or catalog of internal data products is a great\nfirst step to raising awareness, but more is needed to convince skeptical data\nconsumers to actually start using them.\n\nFor this, we can take a page out of Amazon.com’s playbook. Amazon provides an\nincredible amount of detail to help consumers purchase products from unknown\nthird-parties. Take the example of something as simple as a wrench:\n\n[\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png)\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg) I’d buy this wrench.\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png)\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg) I’d buy this wrench.\n](https://www.amazon.com/Amazon-Brand-Denali-8-Inch-\nAdjustable/dp/B091BLK385/ref=sr_1_1_ffob_sspa?crid=39GIJHE50YBB1&keywords=wrench&qid=1681395714&sprefix=wrench%2Caps%2C70&sr=8-1-spons&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE1RDdMRDJXTFMxWEkmZW5jcnlwdGVkSWQ9QTAzMzY4NDQzT0NYSFNPR1A3OFZOJmVuY3J5cHRlZEFkSWQ9QTAxODYxODQxMVZDUzkyNlM4TFFRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ&th=1)\n\nIt’s not just a _wrench_ — it’s an adjustable Denali, 7.7 inch, 4.4 ounce,\nrust resistant steel, adjustable wrench for repairs, maintenance and general\nuse, covered by a limited lifetime warranty. Oh, and here are similar products\nand reviews from users like yourself.\n\nData teams and data product owners need to be as capable of marketing data\nproducts as they are at building them. Otherwise, you’re not going to see the\nadoption levels that justify the value of your data initiative.\n\nThe central “store” for your data products needs to include not just\ninformation about the data, but information about the context of how it can be\nused. In other words, it needs to provide metrics such as uptime or data\nfreshness; these are commonly referred to as service level objectives (SLO)\n\n \n \n\nThoughtworks has helped create one of the more [advanced deployments of Monte\nCarlo — ](https://www.thoughtworks.com/en-th/insights/blog/data-strategy/dev-\nexperience-data-mesh-platform)a data observability platform that monitors the\nhealth and quality of data —[ within a data mesh\nimplementation](https://www.thoughtworks.com/en-th/insights/blog/data-\nstrategy/dev-experience-data-mesh-platform).\n\nIn this post, we will explore the process of implementation and go further by\nexploring what else is possible.\n\n## \n \nWhere to start: Identifying reusable data products\n\nThe two best ways to fail at creating valuable, reusable data products are to\ndevelop them without any sense of who they are for and to make them more\ncomplicated than they need to be.\n\nOne of the best ways to succeed is by involving business and product\nleadership and identifying the most valuable and shared use cases.\nThoughtworks, for example, often identifies potential data products by working\nbackwards from the use case using the [Jobs to be done\n(JTBD)](https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca)\nframework created by Clayton Christensen.\n\n![Example JTBD framework for a Customer 360 data product. Image courtesy of\nthe\nauthors.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_4.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nExample JTBD framework for a Customer 360 data product. Image courtesy of the\nauthors.\n\n![Example JTBD framework for a Customer 360 data product. Image courtesy of\nthe\nauthors.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_4.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nExample JTBD framework for a Customer 360 data product. Image courtesy of the\nauthors.\n\nAnother strategy is to evaluate the [data\nlineage](https://www.montecarlodata.com/blog-data-lineage/) within your\ncurrent environment. It’s likely that your tables will follow some sort of\nPareto distribution where 20% will have 80% of the queries run against them\n(or power 80% of the most-visited dashboards).\n\nFor example, if the table customer_accounts is constantly being queried by\nmarketing, finance, support and other domains, that can be taken as a signal\nthat building a data product that consolidates the necessary information into\na full 360 view may have shared utility.\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_5.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_5.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n## Second step: Creating data product SLOs\n\n##\n\nA key part of data product thinking is keeping the consumers at the center and\nconsidering what provides the most value for them. The only way to ensure we\nare delivering high-quality data products is to identify those consumers,\nunderstand their requirements and codify their expectations within a [SLO/SLI\nframework](https://www.thoughtworks.com/en-us/insights/articles/data-mesh-in-\npractice-product-thinking-and-development).\n\nYou can think of SLOs as measures that remove uncertainty surrounding the data\nand serve as a primary way to define trustworthiness for its consumers.\n\nAs explained in [Zhamak’s Data Mesh\nbook](https://www.oreilly.com/library/view/data-mesh/9781492092384/), in\ncontrast to previous approaches to data management, data mesh introduces a\nfundamental shift in that the owners of data products must communicate and\nguarantee an acceptable level of quality and trust‐worthiness as it is an\nimportant characteristic of the data product. This means cleansing and running\nautomated data integrity tests or data quality monitors at the point the data\nproducts are created.\n\nIf SLOs are breached, the data product team must be notified so they can take\nremediation measures. Like a typical business contract, data product SLOs will\nlikely evolve over time based on changing circumstances.\n\nThoughtworks uses a discovery exercise during its [data mesh\nacceleration](https://martinfowler.com/articles/data-mesh-accelerate-\nworkshop.html#DiscoveringDataProducts) workshop on product usage patterns.\nThis helps teams collectively brainstorm and understand usage, expectations,\ntrade-offs and business impact. The outcomes of the exercise are then used to\ndetermine the various SLOs that need to be set for individual products.\n\n![Product usage pattern exercise template. Courtesy of\nThoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nProduct usage pattern exercise template. Courtesy of Thoughtworks.\n\n![Product usage pattern exercise template. Courtesy of\nThoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nProduct usage pattern exercise template. Courtesy of Thoughtworks.\n\n## Third step: Implementing the SLOs\n\n##\n\nDuring the implementation phase of the data product, the data product team\nwill start by defining the metrics (SLIs) used to measure the SLO.\n\nOne common SLI for data products is freshness. In the example from the\nprevious section, the exercise may reveal the marketing team relies heavily on\na particular dashboard that supports the monitoring of daily campaign and\npurchasing behaviors, which means the data needs to be updated every day.\n\nThe customer service team, on the other hand, may require hourly updates to\nbetter engage with customers in real time. In this scenario, it is almost\ncertainly more efficient to build the data product to be updated hourly to\nserve both consumer groups rather than build two different data products. The\nmarketing team isn’t going to complain about having data that is more\nfrequently updated than they requested after all!\n\nSLIs are typically expressed as a percentage over a period of time. In the\nexample presented earlier, 99% freshness over an hourly interval is the SLI in\nplace for the Customer 360 data product.\n\nIn our example, the team has decided to track data freshness checks based on\nthe processing timestamp attribute present in the dataset that is served by\nthe data product: processing_timestamp. To do this, they start by defining a\n[monitor as code](https://docs.getmontecarlo.com/docs/monitors-as-code) that\nwill become part of the data product which will support the implementation of\nthe freshness SLO:\n\n \n \n namespace: customer-domain\n montecarlo:\n freshness:\n - description: Customer 360 Data Product Freshness Monitor\n name:\xa0 Freshness - Customer 360 Data Product\n table: analytics:prod.customer_360_dp.customers\n freshness_threshold: 240\n schedule:\n type: fixed\n interval_minutes: 240\n start_time: "2022-09-15T01:00:00"\n \n\nThe data team can then automate the deployment of this monitor via the CI/CD\npipeline using the Monte Carlo CLI:\n\n \n \n montecarlo monitors apply --namespace customer-domain\n \n\nThis ensures the monitor to support the SLO is implemented and deployed every\ntime there is a change via the CI/CD pipeline. The monitor as code\nfunctionality improves the experience of the data product developer in\nmaintaining and deploying these monitors at scale using version control \n\nThe stakeholder exercise may also reveal that the Customer 360 data product\nshould not contain deleted rows in the final table as customers will be marked\nas active or inactive rather than removed entirely. To ensure this, a custom\nvolume SLI can be set to monitor and ensure the data product follows this\nbehavior.\n\nFinally, data product users need to be alerted whenever any changes are made\nto the schema of any tables within or upstream of the data product. This is\nbecause such changes could break processes downstream; there could be new\nfields that can enable new use cases. This can be covered by an automated\nschema monitor which sends alerts via the appropriate communication channel.\n\n## Going beyond basic SLOs\n\n##\n\nSo far we have covered three basic dimensions that can be used as SLOs. There\nare several other dimensions improving data product trust such as accuracy and\navailability. These and others are described in the [Implementing Service\nLevel Objectives book](https://www.oreilly.com/library/view/implementing-\nservice-level/9781492076803/).\n\nMore advanced SLOs can better validate data product quality and encourage\nwider use throughout the organization.\n\nFor example, let\'s imagine the data in our Customer 360 data product is not\ncomplete. Perhaps our stakeholder exercise revealed the channel and region\nwhere the customer buys the product is important for the marketing team’s\ndigital advertising decisions while the customer service team cares deeply\nthat every customer has a profile in the system.\n\nWe could use field health monitors on relevant fields within the data product\nsuch as region and purchase_channel to surface the number of anomalies over a\ncertain time period on the attributes the marketing team needs to segment\nusers. If any of these fields experience anomalous NULL rates or values\noutside the typical distribution, remediations can be launched to ensure\ncompliance with stated SLOs. Similarly, we could place field health monitors\non the account_id field to ensure it is never NULL so that the data product\nperforms to the customer service team’s standards.\n\nDeploying field health monitors has the added benefit of profiling the data,\nwhich can provide additional context that helps encourage adoption for those\nnot as familiar with the data or the data product.\n\nWhat the field profile feature looks like:\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_7.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_7.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nLet’s look at another possible SLO related to data quality. Consider a\npurchase order data product tracking the purchases/transactions made by the\ncustomer. This data product is used as a source for Customer 360 data product\nto understand the purchase patterns of the customer based on a\npurchase_timestamp.\n\n \nUsing a [dimension distribution\nmonitor](https://docs.getmontecarlo.com/docs/understanding-dimension-tracking-\nmonitors), we can identify a potential anomaly when an active customer does\nnot have any purchases made in the recent timeline, highlighting the lack of\ndata trust/quality on the upstream purchase order data product.\n\n### Other indicators to build data trust\n\n###\n\nAt this point, we have reassured any potential data product users that there\nare no freshness, volume, schema, or data quality issues that will prevent\nthem from benefiting from its use. But what other information can we surface\nthat will speak to the data product’s trustworthiness?\n\nOne idea is to go beyond describing the data itself to surfacing information\non its level of support and consumption. To harken back to our Denali wrench\nexample, the Amazon.com page doesn’t just describe the product itself, it also\nincludes information on the lifetime warranty. Netflix doesn’t just tell its\nviewers the plot of the movie, it also has a list of the top ten most popular.\n\nThe data product equivalents of this are:\n\n * **Total tests or custom monitors** : If there are more dbt tests across a pipeline or it has more Monte Carlo custom monitors set, this indicates more granular support and reliability in depth.\n\n * **Coverage percentage:** Data products typically involve a series of complex, interdependent operations upstream. Data moves and is transformed from table to table. Understanding that a data product has basic data monitoring coverage across its [data lineage](https://www.montecarlodata.com/blog-data-lineage/) helps further build trust.\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_8.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_8.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n * **Average time to response/fixed:** The overview above, currently at the domain level, highlights important metrics to consider when monitoring the health of the data products. Similar to the stability metrics in the DORA 4 key metrics framework, the metrics shown in this overview, “Time to response” and “Time to Fixed,” indicate how long it takes a data product team to spot and recover from any type of incident that could lead to breaching the SLOs. Faster response and fix times indicate data product stability and highlights the maturity of the supporting data product teams thus increasing the trustworthiness over time. \n\n * **Key asset score:** An all too common story is when a member of the data team leverages a seemingly ideal table or data product as part of their task, only later to find out it’s been deprecated or an older version. Monte Carlo\'s [Key Asset Score, calculated by the ](https://docs.getmontecarlo.com/docs/key-assets-importance-score)reads and writes and the downstream consumption on each dataset part of the data product, can give data product users (and re-users) confidence the asset is safe to use. It can also be helpful for data product owners to measure their success, in a data mesh context, based on the satisfaction and growth of their data product consumers.\n\n## Fourth step: Monitoring and visualizing data product SLO health\n\n##\n\nThe data product teams select what SLOs their data products guarantee, and\nultimately they are responsible for the satisfaction of their data products’\nconsumers. To succeed on this, they need the right tools to monitor and track\nthe SLOs over time.\n\nMonte Carlo\'s notification mechanism enables this by notifying the data\nproduct teams on any SLO breach incident. To improve the developer experience,\nthese notifications can also be defined as\n[code](https://docs.getmontecarlo.com/docs/notifications-as-code) in the\nlatest version of Monte Carlo and be included as part of the CI/CD pipeline.\n\nMonte Carlo also provides functionality to extract some or all of this\nmonitoring metadata via APIs to publish them in catalogs like\n[Collibra](https://www.collibra.com/us/en), [dataworld](https://data.world),\nor [Atlan](https://atlan.com). This is critical for making data products\ndiscoverable. It’s also where all of the work your team has done to create and\nautomatically monitor SLOs and SLIs comes together and is put on display.\n\nData product owners and data platform teams can leverage these APIs to\nvisualize the health of the data products in the marketplace via custom\nintegrations similar to the solution shared in a [past\nwebinar](https://vimeo.com/765878759).\n\n![Service Example - \\(4\\) Check and show service levels Delivering data\nproduct health information as part of the user\nexperience](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_9.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\n![Service Example - \\(4\\) Check and show service levels Delivering data\nproduct health information as part of the user\nexperience](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_9.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nFinally, if you are using dbt for modeling and transforming for your data,\n[Monte Carlo offers a dbt integration\n](https://docs.getmontecarlo.com/docs/dbt-integration)that automates the\nincident creation on every dbt test failure. This provides a holistic view of\nincidents created due to data quality tests failing for a data served by the\ndata product, provides our data quality health of the data product and also\neases debugging. By enabling this integration, the team can leverage Monte\nCarlo’s notification channel to also receive alerts on data quality issues.\n\nTo implement this, the data product team can run the dbt data quality test as\npart of their data pipeline and upload the results to Monte Carlo with a\nsimple CLI command.\n\n \n \n > dbt test\n > montecarlo import dbt-run \\\n --manifest ./target/manifest.json\xa0 \\\n --run-results ./target/run_results.json \\\n --project-name customer-360-data-product\n \n\n## Putting it all together\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_10.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nSource: Montecarlo\n\n![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_10.png)\n\n![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/pause-icon.svg)\n![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/play-icon.svg)\n\nSource: Montecarlo\n\nThe data mesh principles, especially the data as a product concept can create\ntremendous business value.\n\nDefining SLOs will help build reliable data products that fit business user\nneeds and surfacing their value will create the level of data trust required\nfor data driven organizations to thrive.\n\nUltimately, the more product context you provide for data consumers and team\nmembers across the organization, the more efficiencies and value you will be\nable to derive from a “build once use many times” approach. Good luck!\n\n## _Appendix:_\n\nData Mesh Accelerated workshop formulated by Paulo Caroli as explained in this\narticle <https://martinfowler.com/articles/data-mesh-accelerate-workshop.html>\n\nhelps teams and organizations accelerate their Data Mesh transformation, by\nunderstanding their current state and exploring what the next steps will look\nlike.\n\nDisclaimer: The statements and opinions expressed in this article are those of\nthe author(s) and do not necessarily reflect the positions of Thoughtworks.\n\n## Related blogs\n\n[\n![](/content/dam/thoughtworks/images/photography/abstract/insights/blog/abs_blogs_006.jpg)\n![]()\nData mesh Data Mesh in practice: Getting off to the right start Learn more\n](/en-in/insights/articles/data-mesh-in-practice-getting-off-to-the-right-\nstart)\n\n[\n![](/content/dam/thoughtworks/images/photography/abstract/insights/blog/abs_blogs_057.jpg)\n![]()\nData strategy Data Mesh in practice: Organizational operating model Learn more\n](/en-in/insights/articles/data-mesh-in-practice-organizational-operating-\nmodel)\n\n[\n![](/content/dam/thoughtworks/images/photography/abstract/insights/blog/abs_blogs_001.jpg)\n![]()\nData engineering Data Mesh at Glovo Learn more ](/en-in/insights/blog/data-\nengineering/data-mesh-at-glovo)\n\n## How can you achieve faster growth?\n\n[ Connect with us ](/en-in/contact-us)\n\nCompany\n\n * [About us](/en-in/about-us)\n * [What we do](/en-in/what-we-do)\n * [Partnerships](/en-in/about-us/partnerships)\n * [Who we work with](/en-in/clients)\n * [News](/en-in/about-us/news)\n * [Diversity, Equity and Inclusion](/en-in/about-us/diversity-and-inclusion)\n * [Careers](/en-in/careers)\n * [Investors](https://investors.thoughtworks.com/)\n * [Contact us](/en-in/contact-us)\n\nInsights\n\n * [Articles](/en-in/insights/articles)\n * [Blogs](/en-in/insights/blog)\n * [Books](/en-in/insights/books)\n * [Podcasts](/en-in/insights/podcasts)\n\nSite info\n\n * [Privacy policy](/en-in/about-us/privacy-policy)\n * [Accessibility statement](/en-in/about-us/accessibility)\n * [Modern slavery statement](/content/dam/thoughtworks/documents/guide/tw_guide_modern_slavery_statement.pdf)\n * [Corporate Social Responsibility Policy](/content/dam/thoughtworks/documents/guide/tw_guide_csrpolicy_india.pdf)\n * [Policy of Equal Opportunity, Non-Discrimination and Anti-Harassment at the Workplace](/content/dam/thoughtworks/documents/guide/tw_guide_policy_of%20_equal_opportunity_non_discrimination_anti_harassment_india.pdf)\n * [Code of conduct](/content/dam/thoughtworks/documents/guide/tw_guide_code_of_conduct_en.pdf)\n * [Integrity helpline](https://integrity.thoughtworks.com)\n\nConnect with us\n\n[ ](https://www.linkedin.com/company/thoughtworks "Link to Thoughtworks\nLinkedin page") [ ](https://www.facebook.com/Thoughtworks "Link to\nThoughtworks Facebook page") [ ](https://www.twitter.com/thoughtworks "Link to\nThoughtworks Twitter account") [ ](javascript: "Link to Thoughtworks China\nWeChat subscription account QR code")\n\n[×](javascript:void\\(0\\);) WeChat\n\n![QR code to Thoughtworks China WeChat subscription\naccount](/etc.clientlibs/thoughtworks/clientlibs/clientlib-\nsite/resources/images/wechat_qr_code.jpg)\n\n[ ](https://www.youtube.com/user/thoughtworks "Link to Thoughtworks Youtube\npage") [ ](https://www.instagram.com/thoughtworks/ "Link to Thoughtworks\nInstagram page")\n\n© 2024 Thoughtworks, Inc.\n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
In [5]:
Copied!
documents[0]
documents[0]
Out[5]:
Document(id_='83e760c3-aa01-433a-a603-019d12eee223', embedding=None, metadata={'page_label': '1', 'file_name': 'llama2.pdf', 'file_path': '/Users/samvardhan/Desktop/DataEngineer/opensearch_rag/data_pdf/llama2.pdf', 'file_type': 'application/pdf', 'file_size': 13661300, 'creation_date': '2024-04-21', 'last_modified_date': '2024-04-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Llama 2 : Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗Louis Martin†Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom∗\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-\nsource models. We provide a detailed description of our approach to fine-tuning and safety\nimprovements of Llama 2-Chat in order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n†Second author\nContributions for all the authors can be found in Section A.1.arXiv:2307.09288v2 [cs.CL] 19 Jul 2023', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
In [7]:
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
# creates a persistant index to disk
client = QdrantClient(url="http://localhost:6333")
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
# creates a persistant index to disk
client = QdrantClient(url="http://localhost:6333")
In [8]:
Copied!
from llama_index.core.node_parser import SentenceSplitter
text_parser = SentenceSplitter(
chunk_size=1024,
)
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
cur_text_chunks = text_parser.split_text(doc.text)
text_chunks.extend(cur_text_chunks)
doc_idxs.extend([doc_idx] * len(cur_text_chunks))
from llama_index.core.node_parser import SentenceSplitter
text_parser = SentenceSplitter(
chunk_size=1024,
)
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
cur_text_chunks = text_parser.split_text(doc.text)
text_chunks.extend(cur_text_chunks)
doc_idxs.extend([doc_idx] * len(cur_text_chunks))
In [9]:
Copied!
from llama_index.core.schema import TextNode, IndexNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
node = TextNode(
text=text_chunk,
)
src_doc = documents[doc_idxs[idx]]
node.metadata = src_doc.metadata
nodes.append(node)
from llama_index.core.schema import TextNode, IndexNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
node = TextNode(
text=text_chunk,
)
src_doc = documents[doc_idxs[idx]]
node.metadata = src_doc.metadata
nodes.append(node)
In [10]:
Copied!
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="avsolatorio/GIST-Embedding-v0")
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="avsolatorio/GIST-Embedding-v0")
In [11]:
Copied!
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
In [12]:
Copied!
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama2", request_timeout=30.0)
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama2", request_timeout=30.0)
In [13]:
Copied!
from llama_index.core import Settings
from llama_index.core import ServiceContext, set_global_service_context
service_context = ServiceContext.from_defaults(
llm=llm, embed_model=embed_model
)
from llama_index.core import Settings
from llama_index.core import ServiceContext, set_global_service_context
service_context = ServiceContext.from_defaults(
llm=llm, embed_model=embed_model
)
/var/folders/d8/2pt8r3f50tq3863jc_l0zz0r0000gn/T/ipykernel_34872/183125863.py:4: DeprecationWarning: Call to deprecated class method from_defaults. (ServiceContext is deprecated, please use `llama_index.settings.Settings` instead.) -- Deprecated since version 0.10.0. service_context = ServiceContext.from_defaults(
In [14]:
Copied!
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from qdrant_client import models
client = qdrant_client.QdrantClient(location=":memory:")
client.recreate_collection(
collection_name="my_collection",
vectors_config={
"text-dense": models.VectorParams(
size=768,
distance=models.Distance.COSINE,
)
},
sparse_vectors_config={
"text-sparse": models.SparseVectorParams(
index=models.SparseIndexParams()
)
},
)
vector_store = QdrantVectorStore(
collection_name="my_collection", client=client, enable_hybrid=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from qdrant_client import models
client = qdrant_client.QdrantClient(location=":memory:")
client.recreate_collection(
collection_name="my_collection",
vectors_config={
"text-dense": models.VectorParams(
size=768,
distance=models.Distance.COSINE,
)
},
sparse_vectors_config={
"text-sparse": models.SparseVectorParams(
index=models.SparseIndexParams()
)
},
)
vector_store = QdrantVectorStore(
collection_name="my_collection", client=client, enable_hybrid=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
In [15]:
Copied!
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, service_context=service_context
)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, service_context=service_context
)
In [16]:
Copied!
vector_store.add(nodes)
vector_store.add(nodes)
Out[16]:
['b0052b8f-5349-4113-9341-dea28a8d639d', '9afd200b-904d-4e3b-bc61-5fb1a2a13604', '85bb9d30-39a5-4253-aad9-261788f57e6b', 'a078d100-e6ca-469f-83ab-eeaf3770dcb7', 'c978814f-6496-4012-a090-d0bcd125c670', 'c4d39fe1-753a-419d-a32d-8b3ae7bd9b70', '21a2f676-25c2-4e24-8cfe-0a2aaf4399dc', 'c2168d3c-ac69-492e-ac2f-790dc048a2d8', 'd8567608-3d53-4887-a026-362b3db112be', '60a8462a-3b24-4eeb-a03a-681ea00359c5', '89834f7f-d02f-4aa3-b178-f6038b33c556']
In [17]:
Copied!
query_str = "how is the author of the article Building An “Amazon.com” For Your Data Products"
query_embedding = embed_model.get_query_embedding(query_str)
query_str = "how is the author of the article Building An “Amazon.com” For Your Data Products"
query_embedding = embed_model.get_query_embedding(query_str)
In [18]:
Copied!
from llama_index.core.vector_stores import (
VectorStoreQuery,
VectorStoreQueryResult,
)
query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())
from llama_index.core.vector_stores import (
VectorStoreQuery,
VectorStoreQueryResult,
)
query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())
Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench: [ ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ](https://www.amazon.com/Amazon-Brand-Denali-8-Inch- Adjustable/dp/B091BLK385/ref=sr_1_1_ffob_sspa?crid=39GIJHE50YBB1&keywords=wrench&qid=1681395714&sprefix=wrench%2Caps%2C70&sr=8-1-spons&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE1RDdMRDJXTFMxWEkmZW5jcnlwdGVkSWQ9QTAzMzY4NDQzT0NYSFNPR1A3OFZOJmVuY3J5cHRlZEFkSWQ9QTAxODYxODQxMVZDUzkyNlM4TFFRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ&th=1) It’s not just a _wrench_ — it’s an adjustable Denali, 7.7 inch, 4.4 ounce, rust resistant steel, adjustable wrench for repairs, maintenance and general use, covered by a limited lifetime warranty. Oh, and here are similar products and reviews from users like yourself. Data teams and data product owners need to be as capable of marketing data products as they are at building them. Otherwise, you’re not going to see the adoption levels that justify the value of your data initiative. The central “store” for your data products needs to include not just information about the data, but information about the context of how it can be used. In other words, it needs to provide metrics such as uptime or data freshness; these are commonly referred to as service level objectives (SLO) Thoughtworks has helped create one of the more [advanced deployments of Monte Carlo — ](https://www.thoughtworks.com/en-th/insights/blog/data-strategy/dev- experience-data-mesh-platform)a data observability platform that monitors the health and quality of data —[ within a data mesh implementation](https://www.thoughtworks.com/en-th/insights/blog/data- strategy/dev-experience-data-mesh-platform). In this post, we will explore the process of implementation and go further by exploring what else is possible. ## Where to start: Identifying reusable data products The two best ways to fail at creating valuable, reusable data products are to develop them without any sense of who they are for and to make them more complicated than they need to be. One of the best ways to succeed is by involving business and product leadership and identifying the most valuable and shared use cases. Thoughtworks, for example, often identifies potential data products by working backwards from the use case using the [Jobs to be done (JTBD)](https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca) framework created by Clayton Christensen. ![Example JTBD framework for a Customer 360 data product. Image courtesy of the authors.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_4.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Example JTBD framework for a Customer 360 data product. Image courtesy of the authors. ![Example JTBD framework for a Customer 360 data product. Image courtesy of the authors.
In [19]:
Copied!
from llama_index.core.schema import NodeWithScore, TextNode
from llama_index.core.schema import NodeWithScore, TextNode
In [20]:
Copied!
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
In [21]:
Copied!
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List
class VectorDBRetriever(BaseRetriever):
"""Retriever over a qdrant vector store."""
def __init__(
self,
vector_store: 'QdrantVectorStore', # Assuming QdrantVectorStore is defined elsewhere
embed_model: Any,
query_mode: str = "default",
similarity_top_k: int = 2,
) -> None:
"""Initialize parameters."""
self._vector_store = vector_store
self._embed_model = embed_model
self._query_mode = query_mode
self._similarity_top_k = similarity_top_k
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List['NodeWithScore']:
"""Retrieve documents based on the query."""
query_embedding = self._embed_model.get_query_embedding(
query_bundle.query_str
)
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding,
similarity_top_k=self._similarity_top_k,
mode=self._query_mode,
)
query_result = self._vector_store.query(vector_store_query)
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score = query_result.similarities[index] if query_result.similarities is not None else None
nodes_with_scores.append(NodeWithScore(node=node, score=score))
return nodes_with_scores
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List
class VectorDBRetriever(BaseRetriever):
"""Retriever over a qdrant vector store."""
def __init__(
self,
vector_store: 'QdrantVectorStore', # Assuming QdrantVectorStore is defined elsewhere
embed_model: Any,
query_mode: str = "default",
similarity_top_k: int = 2,
) -> None:
"""Initialize parameters."""
self._vector_store = vector_store
self._embed_model = embed_model
self._query_mode = query_mode
self._similarity_top_k = similarity_top_k
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List['NodeWithScore']:
"""Retrieve documents based on the query."""
query_embedding = self._embed_model.get_query_embedding(
query_bundle.query_str
)
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding,
similarity_top_k=self._similarity_top_k,
mode=self._query_mode,
)
query_result = self._vector_store.query(vector_store_query)
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score = query_result.similarities[index] if query_result.similarities is not None else None
nodes_with_scores.append(NodeWithScore(node=node, score=score))
return nodes_with_scores
Compare Result¶
In [22]:
Copied!
hybrid_retriever = VectorDBRetriever(vector_store, embed_model, query_mode="hybrid", similarity_top_k=2)
sparse_retriever = VectorDBRetriever(vector_store, embed_model, query_mode="sparse", similarity_top_k=2)
hybrid_retriever = VectorDBRetriever(vector_store, embed_model, query_mode="hybrid", similarity_top_k=2)
sparse_retriever = VectorDBRetriever(vector_store, embed_model, query_mode="sparse", similarity_top_k=2)
In [23]:
Copied!
def execute_and_compare(query_str: str):
hybrid_response = hybrid_retriever.retrieve(QueryBundle(query_str=query_str))
sparse_response = sparse_retriever.retrieve(QueryBundle(query_str=query_str))
print("Hybrid Results:")
for result in hybrid_response:
print(f"Text: {result.node.get_content()}, Score: {result.score}")
print("\nSparse Results:")
for result in sparse_response:
print(f"Text: {result.node.get_content()}, Score: {result.score}")
def execute_and_compare(query_str: str):
hybrid_response = hybrid_retriever.retrieve(QueryBundle(query_str=query_str))
sparse_response = sparse_retriever.retrieve(QueryBundle(query_str=query_str))
print("Hybrid Results:")
for result in hybrid_response:
print(f"Text: {result.node.get_content()}, Score: {result.score}")
print("\nSparse Results:")
for result in sparse_response:
print(f"Text: {result.node.get_content()}, Score: {result.score}")
In [29]:
Copied!
query_str = "what is Data products?"
execute_and_compare(query_str)
query_str = "what is Data products?"
execute_and_compare(query_str)
Hybrid Results: Text: [Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) ![Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Have you ever come across an internal [data product](https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/modern- data-engineering-playbook/data-as-a-product) and side-eyed it like it’s your kid’s prom date? While it _seems_ like it fits the requirements, you don’t quite trust it — who knows where the data in this shifty table has been. Will it be reliable and safe even after you turn your focus elsewhere? Will the schema stay true? This project is your baby; you just can’t risk it. So, just to be safe you take the extra time to recreate the dataset. ## Data products and trustworthiness According to Zhamak Dehgahi, data products should be discoverable, addressable, trustworthy, self-describing, interoperable and secure. In our experience, most data products only support one or two use cases. That’s a lost opportunity experienced by too many data teams, especially those with decentralized organizational structures or implementing [data mesh](https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to- mesh-it-up/). ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench: [ ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench., Score: 0.8020849742832907 Sparse Results: Text: [Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) ![Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Have you ever come across an internal [data product](https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/modern- data-engineering-playbook/data-as-a-product) and side-eyed it like it’s your kid’s prom date? While it _seems_ like it fits the requirements, you don’t quite trust it — who knows where the data in this shifty table has been. Will it be reliable and safe even after you turn your focus elsewhere? Will the schema stay true? This project is your baby; you just can’t risk it. So, just to be safe you take the extra time to recreate the dataset. ## Data products and trustworthiness According to Zhamak Dehgahi, data products should be discoverable, addressable, trustworthy, self-describing, interoperable and secure. In our experience, most data products only support one or two use cases. That’s a lost opportunity experienced by too many data teams, especially those with decentralized organizational structures or implementing [data mesh](https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to- mesh-it-up/). ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench: [ ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench., Score: 0.8020849742832907
In [34]:
Copied!
query_str = "How to Create data product SLOs?"
execute_and_compare(query_str)
query_str = "How to Create data product SLOs?"
execute_and_compare(query_str)
Hybrid Results: Text: This helps teams collectively brainstorm and understand usage, expectations, trade-offs and business impact. The outcomes of the exercise are then used to determine the various SLOs that need to be set for individual products. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ## Third step: Implementing the SLOs ## During the implementation phase of the data product, the data product team will start by defining the metrics (SLIs) used to measure the SLO. One common SLI for data products is freshness. In the example from the previous section, the exercise may reveal the marketing team relies heavily on a particular dashboard that supports the monitoring of daily campaign and purchasing behaviors, which means the data needs to be updated every day. The customer service team, on the other hand, may require hourly updates to better engage with customers in real time. In this scenario, it is almost certainly more efficient to build the data product to be updated hourly to serve both consumer groups rather than build two different data products. The marketing team isn’t going to complain about having data that is more frequently updated than they requested after all! SLIs are typically expressed as a percentage over a period of time. In the example presented earlier, 99% freshness over an hourly interval is the SLI in place for the Customer 360 data product. In our example, the team has decided to track data freshness checks based on the processing timestamp attribute present in the dataset that is served by the data product: processing_timestamp. To do this, they start by defining a [monitor as code](https://docs.getmontecarlo.com/docs/monitors-as-code) that will become part of the data product which will support the implementation of the freshness SLO: namespace: customer-domain montecarlo: freshness: - description: Customer 360 Data Product Freshness Monitor name: Freshness - Customer 360 Data Product table: analytics:prod.customer_360_dp.customers freshness_threshold: 240 schedule: type: fixed interval_minutes: 240 start_time: "2022-09-15T01:00:00" The data team can then automate the deployment of this monitor via the CI/CD pipeline using the Monte Carlo CLI: montecarlo monitors apply --namespace customer-domain This ensures the monitor to support the SLO is implemented and deployed every time there is a change via the CI/CD pipeline. The monitor as code functionality improves the experience of the data product developer in maintaining and deploying these monitors at scale using version control The stakeholder exercise may also reveal that the Customer 360 data product should not contain deleted rows in the final table as customers will be marked as active or inactive rather than removed entirely. To ensure this, a custom volume SLI can be set to monitor and ensure the data product follows this behavior. Finally, data product users need to be alerted whenever any changes are made to the schema of any tables within or upstream of the data product. This is because such changes could break processes downstream; there could be new fields that can enable new use cases. This can be covered by an automated schema monitor which sends alerts via the appropriate communication channel. ## Going beyond basic SLOs ## So far we have covered three basic dimensions that can be used as SLOs. There are several other dimensions improving data product trust such as accuracy and availability. These and others are described in the [Implementing Service Level Objectives book](https://www.oreilly.com/library/view/implementing- service-level/9781492076803/). More advanced SLOs can better validate data product quality and encourage wider use throughout the organization. For example, let's imagine the data in our Customer 360 data product is not complete., Score: 0.7936968558435114 Sparse Results: Text: This helps teams collectively brainstorm and understand usage, expectations, trade-offs and business impact. The outcomes of the exercise are then used to determine the various SLOs that need to be set for individual products. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ## Third step: Implementing the SLOs ## During the implementation phase of the data product, the data product team will start by defining the metrics (SLIs) used to measure the SLO. One common SLI for data products is freshness. In the example from the previous section, the exercise may reveal the marketing team relies heavily on a particular dashboard that supports the monitoring of daily campaign and purchasing behaviors, which means the data needs to be updated every day. The customer service team, on the other hand, may require hourly updates to better engage with customers in real time. In this scenario, it is almost certainly more efficient to build the data product to be updated hourly to serve both consumer groups rather than build two different data products. The marketing team isn’t going to complain about having data that is more frequently updated than they requested after all! SLIs are typically expressed as a percentage over a period of time. In the example presented earlier, 99% freshness over an hourly interval is the SLI in place for the Customer 360 data product. In our example, the team has decided to track data freshness checks based on the processing timestamp attribute present in the dataset that is served by the data product: processing_timestamp. To do this, they start by defining a [monitor as code](https://docs.getmontecarlo.com/docs/monitors-as-code) that will become part of the data product which will support the implementation of the freshness SLO: namespace: customer-domain montecarlo: freshness: - description: Customer 360 Data Product Freshness Monitor name: Freshness - Customer 360 Data Product table: analytics:prod.customer_360_dp.customers freshness_threshold: 240 schedule: type: fixed interval_minutes: 240 start_time: "2022-09-15T01:00:00" The data team can then automate the deployment of this monitor via the CI/CD pipeline using the Monte Carlo CLI: montecarlo monitors apply --namespace customer-domain This ensures the monitor to support the SLO is implemented and deployed every time there is a change via the CI/CD pipeline. The monitor as code functionality improves the experience of the data product developer in maintaining and deploying these monitors at scale using version control The stakeholder exercise may also reveal that the Customer 360 data product should not contain deleted rows in the final table as customers will be marked as active or inactive rather than removed entirely. To ensure this, a custom volume SLI can be set to monitor and ensure the data product follows this behavior. Finally, data product users need to be alerted whenever any changes are made to the schema of any tables within or upstream of the data product. This is because such changes could break processes downstream; there could be new fields that can enable new use cases. This can be covered by an automated schema monitor which sends alerts via the appropriate communication channel. ## Going beyond basic SLOs ## So far we have covered three basic dimensions that can be used as SLOs. There are several other dimensions improving data product trust such as accuracy and availability. These and others are described in the [Implementing Service Level Objectives book](https://www.oreilly.com/library/view/implementing- service-level/9781492076803/). More advanced SLOs can better validate data product quality and encourage wider use throughout the organization. For example, let's imagine the data in our Customer 360 data product is not complete., Score: 0.7936968558435114
Sparse_retriever¶
In [25]:
Copied!
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
sparse_retriever, service_context=service_context
)
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
sparse_retriever, service_context=service_context
)
In [28]:
Copied!
query_str = "what is Data products?"
response = query_engine.query(query_str)
print(str(response))
query_str = "what is Data products?"
response = query_engine.query(query_str)
print(str(response))
Based on the context information provided, a data product can be defined as a centralized marketplace or catalog of internal data assets that are discoverable, addressable, trustworthy, self-describing, interoperable, and secure. The data product is designed to raise awareness and convince skeptical data consumers to actually start using internal data products. By providing an incredible amount of detail, such as information quality metrics and performance promises, data products can help build trust with data consumers and encourage the reuse of data products. In other words, a data product is a curated collection of internal data assets that are designed to be easily discoverable, accessible, and reusable across different domains and use cases. It provides a clear and consistent understanding of the data assets, their characteristics, and how they can be used, which helps build trust with data consumers and encourages them to adopt the data products for their own use cases.
In [32]:
Copied!
print(response.source_nodes[0].get_content())
print(response.source_nodes[0].get_content())
[Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) ![Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Have you ever come across an internal [data product](https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/modern- data-engineering-playbook/data-as-a-product) and side-eyed it like it’s your kid’s prom date? While it _seems_ like it fits the requirements, you don’t quite trust it — who knows where the data in this shifty table has been. Will it be reliable and safe even after you turn your focus elsewhere? Will the schema stay true? This project is your baby; you just can’t risk it. So, just to be safe you take the extra time to recreate the dataset. ## Data products and trustworthiness According to Zhamak Dehgahi, data products should be discoverable, addressable, trustworthy, self-describing, interoperable and secure. In our experience, most data products only support one or two use cases. That’s a lost opportunity experienced by too many data teams, especially those with decentralized organizational structures or implementing [data mesh](https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to- mesh-it-up/). ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench: [ ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench.
In [35]:
Copied!
response_1 = query_engine.query("How to Create data product SLOs?")
print(str(response_1))
response_1 = query_engine.query("How to Create data product SLOs?")
print(str(response_1))
To create SLOs for a data product, follow these steps: 1. Identify the purpose of the data product: What is the main goal of the data product? What problems does it solve? Who are its target users? 2. Determine the metrics that will be used to measure success: Based on the purpose of the data product, identify the key performance indicators (KPIs) that will be used to evaluate its success. For example, freshness, accuracy, availability, completeness, etc. 3. Set specific and measurable targets for each metric: Define specific target values for each metric, such as 95% freshness rate or 99.9% accuracy rate. 4. Establish a monitoring and alert system: Implement a system to continuously monitor the metrics and alert stakeholders when targets are not met. 5. Review and adjust SLOs regularly: Regularly review the SLOs and adjust them as necessary based on changes in the business or technology landscape. 6. Communicate SLOs to stakeholders: Share the SLOs with stakeholders, including developers, product owners, and executives, to ensure everyone is aligned and working towards the same goals. 7. Use SLOs as a baseline for evaluating progress: Use the SLOs as a basis for evaluating the progress of the data product and identifying areas for improvement. 8. Consider advanced SLOs: In addition to basic SLOs, consider implementing more advanced SLOs such as accuracy, availability, completeness, and other metrics that can help validate data product quality and encourage wider use throughout the organization. By following these steps, you can create effective SLOs for your data product that will help ensure it meets the needs of its users and stakeholders, and continues to improve over time.
In [36]:
Copied!
print(response_1.source_nodes[0].get_content())
print(response_1.source_nodes[0].get_content())
This helps teams collectively brainstorm and understand usage, expectations, trade-offs and business impact. The outcomes of the exercise are then used to determine the various SLOs that need to be set for individual products. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ## Third step: Implementing the SLOs ## During the implementation phase of the data product, the data product team will start by defining the metrics (SLIs) used to measure the SLO. One common SLI for data products is freshness. In the example from the previous section, the exercise may reveal the marketing team relies heavily on a particular dashboard that supports the monitoring of daily campaign and purchasing behaviors, which means the data needs to be updated every day. The customer service team, on the other hand, may require hourly updates to better engage with customers in real time. In this scenario, it is almost certainly more efficient to build the data product to be updated hourly to serve both consumer groups rather than build two different data products. The marketing team isn’t going to complain about having data that is more frequently updated than they requested after all! SLIs are typically expressed as a percentage over a period of time. In the example presented earlier, 99% freshness over an hourly interval is the SLI in place for the Customer 360 data product. In our example, the team has decided to track data freshness checks based on the processing timestamp attribute present in the dataset that is served by the data product: processing_timestamp. To do this, they start by defining a [monitor as code](https://docs.getmontecarlo.com/docs/monitors-as-code) that will become part of the data product which will support the implementation of the freshness SLO: namespace: customer-domain montecarlo: freshness: - description: Customer 360 Data Product Freshness Monitor name: Freshness - Customer 360 Data Product table: analytics:prod.customer_360_dp.customers freshness_threshold: 240 schedule: type: fixed interval_minutes: 240 start_time: "2022-09-15T01:00:00" The data team can then automate the deployment of this monitor via the CI/CD pipeline using the Monte Carlo CLI: montecarlo monitors apply --namespace customer-domain This ensures the monitor to support the SLO is implemented and deployed every time there is a change via the CI/CD pipeline. The monitor as code functionality improves the experience of the data product developer in maintaining and deploying these monitors at scale using version control The stakeholder exercise may also reveal that the Customer 360 data product should not contain deleted rows in the final table as customers will be marked as active or inactive rather than removed entirely. To ensure this, a custom volume SLI can be set to monitor and ensure the data product follows this behavior. Finally, data product users need to be alerted whenever any changes are made to the schema of any tables within or upstream of the data product. This is because such changes could break processes downstream; there could be new fields that can enable new use cases. This can be covered by an automated schema monitor which sends alerts via the appropriate communication channel. ## Going beyond basic SLOs ## So far we have covered three basic dimensions that can be used as SLOs. There are several other dimensions improving data product trust such as accuracy and availability. These and others are described in the [Implementing Service Level Objectives book](https://www.oreilly.com/library/view/implementing- service-level/9781492076803/). More advanced SLOs can better validate data product quality and encourage wider use throughout the organization. For example, let's imagine the data in our Customer 360 data product is not complete.
Hybrid Retrievel¶
In [39]:
Copied!
from llama_index.core.query_engine import RetrieverQueryEngine
hybrid_query_engine = RetrieverQueryEngine.from_args(
hybrid_retriever, service_context=service_context
)
from llama_index.core.query_engine import RetrieverQueryEngine
hybrid_query_engine = RetrieverQueryEngine.from_args(
hybrid_retriever, service_context=service_context
)
In [40]:
Copied!
query_str = "what is Data products?"
response = hybrid_query_engine.query(query_str)
print(str(response))
query_str = "what is Data products?"
response = hybrid_query_engine.query(query_str)
print(str(response))
Based on the provided context, a data product can be defined as a centralized marketplace or catalog of internal data assets that are discoverable, addressable, trustworthy, self-describing, interoperable, and secure. It is important to create a data product that addresses the characteristics originally defined by Zhamak Dehghani, such as discoverability, addressability, trustworthiness, self-description, interoperability, and security. By doing so, data teams can build trust with business stakeholders and encourage the reuse of data products across different domains. Additionally, creating a central marketplace or catalog of internal data products can help raise awareness and convince skeptical data consumers to start using them.
In [41]:
Copied!
print(response.source_nodes[0].get_content())
print(response.source_nodes[0].get_content())
[Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) ![Customer 360 Data Product](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_1.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Have you ever come across an internal [data product](https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/modern- data-engineering-playbook/data-as-a-product) and side-eyed it like it’s your kid’s prom date? While it _seems_ like it fits the requirements, you don’t quite trust it — who knows where the data in this shifty table has been. Will it be reliable and safe even after you turn your focus elsewhere? Will the schema stay true? This project is your baby; you just can’t risk it. So, just to be safe you take the extra time to recreate the dataset. ## Data products and trustworthiness According to Zhamak Dehgahi, data products should be discoverable, addressable, trustworthy, self-describing, interoperable and secure. In our experience, most data products only support one or two use cases. That’s a lost opportunity experienced by too many data teams, especially those with decentralized organizational structures or implementing [data mesh](https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to- mesh-it-up/). ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. ![Data product characteristics as originally defined by Zhamak Dehghani. ](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_2.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench: [ ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench. ![](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_3.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) I’d buy this wrench.
In [42]:
Copied!
response_1 = hybrid_query_engine.query("How to Create data product SLOs?")
print(str(response_1))
response_1 = hybrid_query_engine.query("How to Create data product SLOs?")
print(str(response_1))
Creating SLOs (Service Level Objectives) for a data product involves several steps: 1. Identify the usage patterns of the data product: Understand how the data product is being used and what are the key metrics that are important to measure. This can be done through surveys, interviews, or by analyzing usage patterns. 2. Define the SLOs: Based on the usage patterns identified in step 1, define the SLOs that are relevant to the data product. These could include things like freshness, accuracy, completeness, and availability. 3. Express the SLOs as percentages: Once the SLOs have been defined, express them as percentages over a period of time. For example, "The data product will be 95% fresh within an hourly interval." 4. Monitor and measure the SLOs: Use monitoring tools and techniques to track the SLOs and measure their compliance. This can be done through automated checks or manual audits. 5. Adjust the SLOs as needed: Based on the measurements taken, adjust the SLOs as needed to ensure they are achievable and meaningful. 6. Communicate the SLOs to stakeholders: Share the SLOs with stakeholders to ensure everyone is aware of what is expected from the data product. 7. Continuously improve: Use the measurements taken to continuously improve the data product and increase its trustworthiness. It's important to note that SLOs are not static and may need to be adjusted over time as the data product evolves and the needs of the stakeholders change.
In [43]:
Copied!
print(response_1.source_nodes[0].get_content())
print(response_1.source_nodes[0].get_content())
This helps teams collectively brainstorm and understand usage, expectations, trade-offs and business impact. The outcomes of the exercise are then used to determine the various SLOs that need to be set for individual products. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ![Product usage pattern exercise template. Courtesy of Thoughtworks.](/content/dam/thoughtworks/images/infographic/Tw_illustration_blog_montecarlo_6.png) ![Pause](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/pause-icon.svg) ![Play](/etc.clientlibs/thoughtworks/clientlibs/clientlib- site/resources/images/play-icon.svg) Product usage pattern exercise template. Courtesy of Thoughtworks. ## Third step: Implementing the SLOs ## During the implementation phase of the data product, the data product team will start by defining the metrics (SLIs) used to measure the SLO. One common SLI for data products is freshness. In the example from the previous section, the exercise may reveal the marketing team relies heavily on a particular dashboard that supports the monitoring of daily campaign and purchasing behaviors, which means the data needs to be updated every day. The customer service team, on the other hand, may require hourly updates to better engage with customers in real time. In this scenario, it is almost certainly more efficient to build the data product to be updated hourly to serve both consumer groups rather than build two different data products. The marketing team isn’t going to complain about having data that is more frequently updated than they requested after all! SLIs are typically expressed as a percentage over a period of time. In the example presented earlier, 99% freshness over an hourly interval is the SLI in place for the Customer 360 data product. In our example, the team has decided to track data freshness checks based on the processing timestamp attribute present in the dataset that is served by the data product: processing_timestamp. To do this, they start by defining a [monitor as code](https://docs.getmontecarlo.com/docs/monitors-as-code) that will become part of the data product which will support the implementation of the freshness SLO: namespace: customer-domain montecarlo: freshness: - description: Customer 360 Data Product Freshness Monitor name: Freshness - Customer 360 Data Product table: analytics:prod.customer_360_dp.customers freshness_threshold: 240 schedule: type: fixed interval_minutes: 240 start_time: "2022-09-15T01:00:00" The data team can then automate the deployment of this monitor via the CI/CD pipeline using the Monte Carlo CLI: montecarlo monitors apply --namespace customer-domain This ensures the monitor to support the SLO is implemented and deployed every time there is a change via the CI/CD pipeline. The monitor as code functionality improves the experience of the data product developer in maintaining and deploying these monitors at scale using version control The stakeholder exercise may also reveal that the Customer 360 data product should not contain deleted rows in the final table as customers will be marked as active or inactive rather than removed entirely. To ensure this, a custom volume SLI can be set to monitor and ensure the data product follows this behavior. Finally, data product users need to be alerted whenever any changes are made to the schema of any tables within or upstream of the data product. This is because such changes could break processes downstream; there could be new fields that can enable new use cases. This can be covered by an automated schema monitor which sends alerts via the appropriate communication channel. ## Going beyond basic SLOs ## So far we have covered three basic dimensions that can be used as SLOs. There are several other dimensions improving data product trust such as accuracy and availability. These and others are described in the [Implementing Service Level Objectives book](https://www.oreilly.com/library/view/implementing- service-level/9781492076803/). More advanced SLOs can better validate data product quality and encourage wider use throughout the organization. For example, let's imagine the data in our Customer 360 data product is not complete.