Saturday, 8 February 2025

GitHub Codespaces: A Fast-Track to Development with Minimal Setup

Do you like coding but you hate the scaffolding and prep-work?

As developer, I often spend a considerable amount of time setting up development environments and the project scaffolding before I even write a single line of code. Configuring dependencies, installing tools, and making sure everything runs smoothly across different machines can be tedious. IF you find this prep work time consuming and constraining then...

Enter GitHub Codespaces 

GitHub


GitHub Codespaces is  cloud-based development environment that allows you to start coding instantly without the hassle of setting up a local machine on your browser!

Whether you’re working on an open-source project, collaborating with a team, or quickly prototyping an idea, Codespaces provides a streamlined workflow with minimal scaffolding.

Why GitHub Codespaces?

  1. Instant Development Environments
    With a few clicks, you get a fully configured development environment in the cloud. No need to install dependencies manually—just launch a Codespace, and it’s ready to go.

  2. Pre-configured for Your Project
    Codespaces can use Dev Containers (.devcontainer.json) to define dependencies, extensions, and runtime settings. This means every team member gets an identical setup, reducing "works on my machine" issues.

  3. Seamless GitHub Integration
    Since Codespaces runs directly on GitHub, pushing, pulling, and collaborating on repositories is effortless. No need to clone and configure repositories locally.

  4. Access from Anywhere
    You can code from a browser, VSCode desktop, or even an iPad, making it an excellent option for developers who switch devices frequently.

  5. Powerful Compute Resources
    Codespaces provides scalable cloud infrastructure, so even resource-intensive projects can run smoothly without overloading your local machine.

A Real-World Example

Imagine you’re starting a new Streamlit project on their community. Normally, you’d:

  • Install Streamlit and other packages
  • Set up a virtual environment
  • Configure dependencies
  • Ensure all team members have the same setup

With GitHub Codespaces, you can define everything in a requirements.txt and .devcontainer.json file and launch your environment in seconds. No more worrying about mismatched Python versions or missing dependencies—just open a browser and start coding. 

See below how I obtained this coding environment to built a Weather Streamlit app quickly and for FREE using the Streamlit community Cloud 




All in one browser page using GitHub, browser edition of VScode and access to a free machine on Streamlit Community Cloud with GitHub Codespace for development.

To see the above app visit https://click-weather.streamlit.app/

Final Thoughts

GitHub Codespaces is a game-changer for modern development I think. It eliminates the friction of setting up environments, making collaboration effortless and speeding up development cycles. If you haven’t tried it yet, spin up a Codespace for your next project—you might never go back to traditional setups on your laptop anymore. 

There is another tool I want to look at which does all the scaffolding automatically with AI. Is the IDE called 'Windsurf' from Codeium, but that's another blog post.



Friday, 27 September 2024

Understanding the Contextual Awareness of data

    Image attribution: Thomas Nordwest, CC BY-SA 4.0, via Wikimedia Commons

  • What sources does this data come from?
  • How was this data collected in the first place?
  • Do you clearly understand the structure and format of your data?
  • What specific business objectives does this data help you achieve?
  • Do you know that there are relationships between this dataset and others, if you do which?
  • What limitations or biases there may exist in this data?
  • How frequently is this data updated or refreshed, is it stale?
  • What context is important to consider when trying to understand this data?
  • What questions are you trying to answer with this data?
  •  
    The above questions can help us achieve data zen, or the contextual awareness we all seek.

    Contextual Awareness of data refers to the ability to understand and interpret data within its relevant situation, including the source, structure, and intended use. It is the 'world of interest' it tries to describe. It goes beyond merely collecting data; it encompasses a comprehensive understanding of what the data represents, where it originates, and how it relates to other datasets. This awareness is crucial for organisations to derive meaningful insights, as data without context can lead to misinterpretation and ineffective decision-making.

    The importance of contextual awareness cannot be overstated. Without a clear understanding of the data’s context, organisations risk making decisions based on incomplete or misleading information. For instance, data collected from disparate sources may appear accurate in isolation but may contradict one another when analysed together. By fostering contextual awareness, organisations can ensure they ask the right questions and identify relevant insights, ultimately driving strategic initiatives and operational efficiency.

    Furthermore, contextual awareness enhances data quality assessment and improves the overall data governance framework. When organisations know the context of their data, when they know the semantics or the meaning of their data, they can better assess its reliability and relevance. This clarity allows for more informed decision-making processes, ensuring that data insights are actionable and aligned with business objectives. In today's data-driven landscape, cultivating contextual awareness is not just useful—it is essential for achieving the competitive edge and fostering a data-driven culture within the organisations.

    Several tools like conceptual data models, data profiling, talking to colleagues, data catalogs and ontologies can all help you in the journey of understanding your data.

    So ask loads of questions the next time you see that data set!


    Tuesday, 9 January 2024

    Unraveling Data Architecture: Data Fabric vs. Data Mesh

    In this first post of this year, I would like to talk about modern data architectures and specifically about two prominent models, Data Fabric and Data Mesh. Both are potential solutions for organisations working with complex data and database systems.
    While both try to make our data lives better and try to bring us the abstraction we need when working with data, they are different in their approach.
    But when would be best to use one over the other? Let's try to understand that.
    Definitions Data Fabric and Data Mesh
    Some definitions first. I come up with these after reading up on the internet and various books on the subject.
    Data Fabric is a centralised data architecture focusing on creating a unified data environment. In it's environment data fabric integrates different data sources and provides seamless data access, governance and management and often it does this via tools and automation. Keywords to remember from the Data Fabric definition are centralised, unified, seamless data access.

    Data Mesh, is a paradigm shift, it is a completely different way of doing things. Is based on domains, most likely inspired by the Domain Driven Design (DDD) example, is capturing data products in domains by decentralising data ownership (autonomy) and access. That is, the domains are the owners of the data and are responsible themselves for creating and maintaining their own data products. The responsibility is distributed. Key words to take away from Data Mesh are decentralised, domain ownership, autonomy and data products.

    Criteria for Choosing between Data Fabric and Data Mesh

    Data Fabric might be preferred when there is a need for centralised governance and control over the data entities. When a unified view is needed across all database systems and data sources in the organisation or departments. Transactional database workloads are very suitable for this type of data architecture where consistency and integrity in their data operations is paramount.

    Data Mesh can be more suitable for organisational cultures or departments where scalability and agility is a priority. Because of its domain-driven design, Data Mesh might be a better fit for organisations or departments that are decentralised, innovative, and require their business units to swiftly and independently decide how to handle their data. Analytical workloads and other big data workloads may be more suitable to Data Mesh data architectures.

    Ultimately, the decision-making process between these data architectures hinges on the load of data processing and the alignment of diverse data sources. There's no universal solution applicable to all scenarios or one size fits all. Organisations and departments in organisations operate within unique cultural and environmental contexts, often necessitating thorough research, proof of concept, and pattern evaluation to identify the optimal architectural fit.

    Remember, in the realm of data architecture, the data workload reigns supreme - it dictates the design.

    Friday, 24 November 2023

    Using vector databases for context in AI


    Image generated by AI using OpenAI's DALL·E


    In the realm of Artificial Intelligence (AI), understanding and retaining context stands as a pivotal factor for decision-making and enhanced comprehension. Vector databases, are the foundational pillars in encapsulating your own data to be used in conjunction with AI and LLMs. Vector databases are empowering these systems to absorb and retain intricate contextual information.

    Understanding Vector Databases

    Vector databases are specialised data storage systems engineered to efficiently manage and retrieve vectorised data - also known as embeddings. These databases store information in a vector format, where each data entity is represented as a multidimensional numerical vector, encapsulating various attributes and relationships, thus fostering the preservation of rich context. That is text, video or audio is translated into numbers with many attributes in the multidimensional space. Then mathematics are used to calculate the proximity between these numbers. Loosely speaking that is what a neural network in an LLM does, it computes proximity (similarity) between the vectors. A bit like how our brains do pattern recognition. The vector database is the database where the vectors are stored. Without a vector databases under architectures like RAG, it is impossible to bring own data or context into an LLM app in an AI model as all it will know will be only what it is trained on from the public internet. Vector databases enable you to bring your own data to AI.

    Examples of Vector Databases

    Several platforms offer vector databases, such as Pinecone, Faiss by Facebook, Annoy, Milvus, and Elasticsearch with dense vector support. These databases cater to diverse use cases, offering functionalities tailored to handle vast amounts of vectorised information, be it images, text, audio, or other complex data types.

    Importance in AI Context

    Within the AI landscape, vector databases play a pivotal role in serving specific data and context for AI models. Particularly, in the Retrieval-Augmented Generation (RAG) architecture, where retrieval of relevant information is an essential part of content generation, vector databases act as repositories, storing precomputed embeddings from your own private data. These embeddings encode the semantic and contextual essence of your data, facilitating efficient retrieval in your AI apps and Bots. Bringing vector databases to your AI apps or chatbots will bring your own data to your AI apps, agents and chatbots will speak your data!

    Advantages for Organisations and AI Applications

    Organisations can harness the prowess of vector databases within Retrieval-Augmented Generation (RAG) architectures to elevate their AI applications and enable them to use organisational specific data:

    1. Enhanced Contextual Understanding: By leveraging vector databases, AI models grasp nuanced contextual information, enabling more informed decision-making and more precise content generation based on specific and private organisational context.

    2. Improved Efficiency in Information Retrieval: Vector databases expedite the retrieval of pertinent information by enabling similarity searches based on vector representations, augmenting the speed and accuracy of AI applications.

    3. Scalability and Flexibility: These databases offer scalability and flexibility, accommodating diverse data types and expanding corpora, essential for the evolving needs of AI-driven applications.

    4. Optimised Resource Utilisation: Vector databases streamline resource utilisation by efficiently storing and retrieving vectorised data, thus optimising computational resources and infrastructure.

    Closing Thoughts

    In the AI landscape, where the comprehension of context is paramount, vector databases emerge as linchpins, fortifying AI systems with the capability to retain and comprehend context-rich information. Their integration within Retrieval-Augmented Generation (RAG)  architectures not only elevates AI applications but also empowers organisations to glean profound insights, fostering a new era of context-driven AI innovation from data.

    In essence, the power vested in vector databases will reshape the trajectory of AI, propelling it toward unparalleled contextualisation and intelligent decision-making based on in house and organisations own data.

    But the enigma persists: What precisely will be the data fuelling the AI model?

    Sunday, 16 April 2023

    Vscode container development

    If you're a software developer, you know how important it is to have a development environment that is flexible, efficient, and easy to use. PyCharm is a popular IDE (Integrated Development Environment) for Python developers, but there are other options out there that may suit your needs better. One such option is Visual Studio Code, or VS Code for short.

    After using PyCharm for a while, I decided to give VS Code a try, and I was pleasantly surprised by one of its features: the remote container development extension. This extension allows you to develop your code in containers, with no footprint on your local machine at all. This means that you can have a truly ephemeral solution, enabling abstraction to the maximum.

    So, how does it work? First, you need to create two files: a Dockerfile and a devcontainer.json file. These files should be located in a hidden .devcontainer folder at the root location of any of your GitHub projects.

    The Dockerfile is used to build the container image that will be used for development. Here's a sample Dockerfile that installs Python3, sudo, and SQLite3:


    FROM ubuntu:20.04

    ARG DEBIAN_FRONTEND=noninteractive

    RUN apt-get update -y

    RUN apt-get install -y python3

    RUN apt-get install -y sudo

    RUN apt-get install -y sqlite3


    The devcontainer.json file is used to configure the development environment in the container. Here's a sample devcontainer.json file that sets the workspace folder to "/workspaces/alpha", installs the "ms-python.python" extension, and forwards port 8000:

    {

        "name": "hammer",

        "build": {

            "context": ".",

            "dockerfile": "./Dockerfile"

        },

        "workspaceFolder": "/workspaces/alpha",

        "extensions": [

            "ms-python.python"

        ],

        "forwardPorts": [

            8000

        ]

    }


    Once you have these files ready, you can clone your GitHub code down to a Visual Studio Code container volume. Here's how to do it:

    1. Start Visual Studio Code
    2. Make sure you have the "Remote Development" extension installed and enabled
    3. Go to the "Remote Explorer" extension from the button menu
    4. Click "Clone Repository in Container Volume" at the bottom left
    5. In the Command Palette, choose "Clone a repository from GitHub in a Container Volume" and pick your GitHub repo.

    That's it! You are now tracking your code inside a container volume, built by a Dockerfile which is also being tracked on GitHub together with all your environment-specific extensions you require for development.

    The VS Code remote container development extension is a powerful tool for developers who need a flexible, efficient, and easy-to-use development environment. By using containers, you can create an ephemeral solution that allows you to abstract away the complexities of development environments and focus on your code. If you're looking for a new IDE or just want to try something different, give VS Code a try with the remote container development extension.