Developer & Contributor Guide¶
This section combines developer setup, architecture overview, and migration information for developers working with LiuEmbeddings.
Developer Setup¶
Prerequisites¶
- Python 3.8+
- pip package manager
- Git for version control
Local Development Installation¶
Clone the repository and install in development mode:
git clone https://github.com/himanshuclub88/liuembeddings.git
cd liuembeddings
pip install -e ".[dev]"
Running Tests¶
Run the test suite to verify your environment:
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=liuembeddings
# Run specific test file
pytest tests/test_embeddings.py -v
Code Quality Tools¶
Format and validate code:
# Format code
black liuembeddings/
# Check style
flake8 liuembeddings/
# Type checking
mypy liuembeddings/
# All checks at once
black liuembeddings/ && flake8 liuembeddings/ && mypy liuembeddings/
Architecture Overview¶
Core Modules¶
config.py - Central configuration management - Model registry and metadata - Default values for all settings - Logging configuration - Key Class: LiuConfig
logger.py - Logging infrastructure - Setup and configuration - Consistent logging format - Log level management
embeddings.py - HuggingFace embeddings wrapper - Universal Sentence Encoder support - Multiple model support - Model caching and batch processing - Key Class: LiuEmbeddings
vectorstore.py - ChromaDB integration - Vector storage and retrieval - CRUD operations - Metadata filtering and batch operations - Key Class: LiuVectorStore
utils.py - Text processing utilities - Text splitting with overlap - Text cleaning and normalization - Batch generators - Key Functions: split_text(), clean_text()
Module Dependencies¶
config.py
↓
logger.py (depends on config)
↓
embeddings.py (depends on logger, config)
vectorstore.py (depends on logger, config, embeddings)
utils.py (depends on logger, config)
↓
__init__.py (depends on all above)
Migration Guide¶
Overview¶
liuembeddings v2.0.0 introduces a major upgrade — moving from TensorFlow Hub (USE) to Hugging Face Sentence-Transformers for faster, lightweight, and modern embeddings.
What Changed¶
| Area | v1.x (Old) | v2.0.0 (New) |
|---|---|---|
| Backend | TensorFlow Hub (USE) | Hugging Face Sentence-Transformers |
| Dependencies | tensorflow, tensorflow_hub | sentence-transformers, torch |
| Default Model | USE | MiniLM |
| Performance | Slow, large dependencies | Faster, smaller, and GPU/CPU friendly |
| API | Mostly the same | Same API, different model loading mechanism |
| Supported Models | Only USE / USE-Large | MiniLM, MPNet, E5, BGE |
| Model Cache | TensorFlow in-memory | Hugging Face in-memory |
New Dependencies¶
Uninstall TensorFlow-related packages:
pip uninstall tensorflow tensorflow-hub -y
Model List in v2.0.0¶
| Model Name | ID | Dimension | Size (MB) | Accuracy |
|---|---|---|---|---|
MiniLM | sentence-transformers/all-MiniLM-L6-v2 | 384 | 22 | 0.78 |
MPNetBase | sentence-transformers/all-mpnet-base-v2 | 768 | 420 | 0.82 |
USE | intfloat/e5-base-v2 | 768 | 300 | 0.84 |
USEL | BAAI/bge-base-en-v1.5 | 1024 | 1024 | 0.86 |
Code Migration Example¶
Old (v1.x)
from liuembeddings.embeddings import LiuEmbeddings
embedder = LiuEmbeddings("USE")
vec = embedder.embed_query("Hello world")
New (v2.0.0)
from liuembeddings.embeddings import LiuEmbeddings
embedder = LiuEmbeddings("USE") # or "MiniLM" for speed
vec = embedder.embed_query("Hello world")
✅ Same API, just use one of the new supported model names.
Config Changes¶
Old - LiuConfig.AVAILABLE_MODELS contained only TensorFlow Hub models (USE, USE-Large).
New - LiuConfig.AVAILABLE_MODELS now includes modern transformer models with metadata:
{
"MiniLM": {...},
"MPNetBase": {...},
"USE": {...},
"USEL": {...}
}
Removed TensorFlow Environment Settings¶
In v1.x we had:
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
This is no longer needed. ✅ All models now run directly via PyTorch or CPU using Sentence-Transformers.
Benefits of v2.0.0¶
- 🚀 30–50× faster installation (no TensorFlow)
- 🧠 Higher accuracy via modern transformer encoders
- 💾 Smaller dependency footprint
- 🔄 Same API for easy migration
- 🧩 Future support for multilingual and domain-specific models
Verification¶
Run a quick embedding test:
from liuembeddings import LiuEmbeddings
embedder = LiuEmbeddings("E5Base")
print(len(embedder.embed_query("Migration complete!")))
Output should show a vector of 768 dimensions without TensorFlow logs.
Migration Summary¶
| Action | Description |
|---|---|
| 🧹 Uninstall TensorFlow | pip uninstall tensorflow tensorflow-hub |
| 📦 Install Transformers | pip install sentence-transformers |
| 🔄 Update Code | Change "USE-Large" → "E5Base" or "MiniLM" |
| ⚙️ No API Change | Keep using embed_query, embed_documents, embed_documents_batch |
Testing Guidelines¶
Writing Tests¶
Create tests in the tests/ directory following the existing patterns:
import pytest
from liuembeddings import LiuEmbeddings, LiuVectorStore
def test_embeddings_basic():
embedder = LiuEmbeddings(model_name="USE")
result = embedder.embed_query("test")
assert isinstance(result, list)
assert len(result) > 0
Test Coverage¶
Maintain high code coverage (target: >80%):
pytest tests/ --cov=liuembeddings --cov-report=html
Performance Tips¶
- Batch Processing - Use
embed_documents_batch()for large document sets - Model Selection - Choose MiniLM for speed, MPNet for accuracy
- Chunk Size - Balance context and precision (typical: 400-500 tokens)
- Metadata Filtering - Filter before similarity search when possible
Contributing¶
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Make your changes and add tests
- Run code quality checks
- Commit with clear messages
- Push to your fork and create a Pull Request
Package Structure¶
liuembeddings/
├── liuembeddings/
│ ├── __init__.py
│ ├── embeddings.py
│ ├── vectorstore.py
│ ├── utils.py
│ ├── config.py
│ └── logger.py
├── tests/
│ ├── test_embeddings.py
│ └── test_vectorstore.py
├── docs/
├── setup.py
├── requirements.txt
└── README.md