Hitchhiker's Guide to AI, Software Architecture, and Everything Else: BUILDING A SIMPLE TEXT BEAUTIFICATION AND IMPROVEMENT LLM CHATBOT USING HUGGINGFACE LIBRARIES

INTRODUCTION

Text beautification and improvement represents a sophisticated application of natural language processing that goes beyond simple grammar correction. A well-designed chatbot for this purpose can enhance writing quality, improve clarity, adjust tone, expand content meaningfully, and adapt to different languages automatically. This comprehensive guide explores the construction of such a system using HuggingFace's powerful ecosystem of transformer models and supporting libraries.

The chatbot we will build serves multiple functions. It can receive text input directly through conversational prompts or process text files uploaded by users. The system automatically detects the language of the input text and applies appropriate improvements while maintaining the original language and cultural context. Additionally, the chatbot can extend articles or documents by generating relevant, coherent additional content that matches the style and subject matter of the original text.

ARCHITECTURAL OVERVIEW AND CORE COMPONENTS

The foundation of our text beautification chatbot rests on several interconnected components that work together to process, analyze, and improve text input. The architecture follows a modular design pattern that separates concerns and allows for easy maintenance and extension.

The primary components include a language detection module, a text preprocessing pipeline, multiple specialized transformer models for different improvement tasks, a content extension engine, and a user interface layer that handles both direct text input and file uploads. Each component communicates through well-defined interfaces, ensuring loose coupling and high cohesion.

from transformers import (

AutoTokenizer, AutoModelForSeq2SeqLM,

AutoModelForSequenceClassification, pipeline

)

from langdetect import detect

import torch

import re

import os

from typing import Dict, List, Tuple, Optional

class TextBeautificationChatbot:

def __init__(self):

"""

Initialize the chatbot with necessary models and pipelines.

Sets up language detection, grammar correction, style improvement,

and content extension capabilities.

"""

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

self.models = {}

self.tokenizers = {}

self.pipelines = {}

self._initialize_models()

The initialization method establishes the computational environment and prepares the various models that will be used throughout the text processing pipeline. The system automatically detects whether GPU acceleration is available and configures the models accordingly for optimal performance.

HUGGINGFACE LIBRARIES AND MODEL SELECTION

HuggingFace provides an extensive collection of pre-trained models that excel at different aspects of text processing. For our beautification chatbot, we carefully select models that complement each other and cover the full spectrum of text improvement tasks.

The grammar correction component utilizes models specifically fine-tuned for error detection and correction. These models have been trained on large datasets of text pairs containing errors and their corrections, enabling them to identify and fix grammatical mistakes, spelling errors, and punctuation issues.

def _initialize_models(self):

"""

Load and initialize all required models for text processing.

This includes grammar correction, style improvement, and content generation models.

"""

# Grammar and spelling correction model

self.models['grammar'] = AutoModelForSeq2SeqLM.from_pretrained(

"pszemraj/flan-t5-large-grammar-synthesis"

).to(self.device)

self.tokenizers['grammar'] = AutoTokenizer.from_pretrained(

"pszemraj/flan-t5-large-grammar-synthesis"

)

# Style improvement and paraphrasing model

self.models['style'] = AutoModelForSeq2SeqLM.from_pretrained(

"tuner007/pegasus_paraphrase"

).to(self.device)

self.tokenizers['style'] = AutoTokenizer.from_pretrained(

"tuner007/pegasus_paraphrase"

)

# Content generation and extension model

self.pipelines['generator'] = pipeline(

"text-generation",

model="microsoft/DialoGPT-large",

device=0 if torch.cuda.is_available() else -1

)

The style improvement model focuses on enhancing the overall quality and readability of text. It can rephrase sentences to improve flow, eliminate redundancy, and enhance clarity while preserving the original meaning. The content generation model enables the chatbot to extend articles or documents by producing relevant additional content that maintains consistency with the existing text.

LANGUAGE DETECTION AND MULTILINGUAL SUPPORT

Supporting multiple languages requires sophisticated detection mechanisms and language-specific processing pipelines. The chatbot must accurately identify the language of input text and apply appropriate models and techniques for that specific language.

def detect_language(self, text: str) -> str:

"""

Detect the language of the input text using statistical analysis.

Returns ISO language code for the detected language.

Args:

text: Input text for language detection

Returns:

str: ISO language code (e.g., 'en', 'es', 'fr')

"""

try:

# Remove special characters and numbers for better detection

clean_text = re.sub(r'[^\w\s]', ' ', text)

clean_text = re.sub(r'\d+', '', clean_text)

if len(clean_text.strip()) < 10:

return 'en' # Default to English for very short texts

detected_lang = detect(clean_text)

return detected_lang

except:

return 'en' # Fallback to English if detection fails

def get_language_specific_models(self, language_code: str) -> Dict:

"""

Return appropriate models based on detected language.

Some models work better for specific languages.

Args:

language_code: ISO language code

Returns:

Dict: Dictionary containing language-specific model configurations

"""

language_models = {

'en': {

'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',

'style_model': 'tuner007/pegasus_paraphrase'

'es': {

'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',

'style_model': 'tuner007/pegasus_paraphrase'

'fr': {

'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',

'style_model': 'tuner007/pegasus_paraphrase'

}

return language_models.get(language_code, language_models['en'])

The language detection system employs statistical analysis to identify the language of input text. It preprocesses the text by removing special characters and numbers that might interfere with accurate detection. For very short text snippets where detection might be unreliable, the system defaults to English while maintaining the capability to handle longer texts in multiple languages.

TEXT PREPROCESSING AND IMPROVEMENT PIPELINE

The text improvement pipeline represents the core functionality of our chatbot. It processes input text through multiple stages, each designed to address specific aspects of text quality and readability.

def preprocess_text(self, text: str) -> str:

"""

Clean and prepare text for processing by removing unnecessary

whitespace, fixing basic formatting issues, and normalizing structure.

Args:

text: Raw input text

Returns:

str: Preprocessed text ready for improvement

"""

# Remove excessive whitespace

text = re.sub(r'\s+', ' ', text)

# Fix common punctuation spacing issues

text = re.sub(r'\s+([,.!?;:])', r'\1', text)

text = re.sub(r'([.!?])\s*([A-Z])', r'\1 \2', text)

# Ensure proper sentence spacing

text = re.sub(r'([.!?])\s+', r'\1 ', text)

# Remove leading/trailing whitespace

text = text.strip()

return text

def improve_grammar(self, text: str, language: str = 'en') -> str:

"""

Correct grammatical errors, spelling mistakes, and punctuation issues

in the input text using transformer models.

Args:

text: Input text to be corrected

language: Language code for language-specific processing

Returns:

str: Grammar-corrected text

"""

# Prepare input for the grammar model

input_text = f"grammar: {text}"

# Tokenize input

inputs = self.tokenizers['grammar'].encode(

input_text,

return_tensors="pt",

max_length=512,

truncation=True

).to(self.device)

# Generate corrected text

with torch.no_grad():

outputs = self.models['grammar'].generate(

inputs,

max_length=512,

num_beams=4,

temperature=0.7,

do_sample=True,

early_stopping=True

)

# Decode and return corrected text

corrected_text = self.tokenizers['grammar'].decode(

outputs[0],

skip_special_tokens=True

)

return corrected_text.strip()

The preprocessing stage handles basic text cleaning and formatting normalization. It addresses common issues such as excessive whitespace, improper punctuation spacing, and inconsistent sentence structure. This preparation ensures that subsequent processing stages receive well-formatted input, leading to better results.

The grammar improvement function leverages transformer models specifically trained for error correction. It uses beam search decoding to generate multiple candidate corrections and selects the most appropriate one. The temperature parameter controls the creativity of the generation process, balancing between conservative corrections and more substantial improvements.

STYLE ENHANCEMENT AND READABILITY IMPROVEMENT

Style enhancement goes beyond basic grammar correction to improve the overall quality, flow, and readability of text. This component analyzes sentence structure, word choice, and overall coherence to produce more polished and engaging content.

def enhance_style(self, text: str) -> str:

"""

Improve text style, readability, and flow while preserving meaning.

This includes sentence restructuring, word choice optimization,

and coherence enhancement.

Args:

text: Grammar-corrected text

Returns:

str: Style-enhanced text

"""

# Split text into sentences for individual processing

sentences = self._split_into_sentences(text)

enhanced_sentences = []

for sentence in sentences:

if len(sentence.strip()) < 10:

enhanced_sentences.append(sentence)

continue

# Prepare input for style model

input_text = f"paraphrase: {sentence}"

# Tokenize

inputs = self.tokenizers['style'].encode(

input_text,

return_tensors="pt",

max_length=256,

truncation=True

).to(self.device)

# Generate enhanced version

with torch.no_grad():

outputs = self.models['style'].generate(

inputs,

max_length=256,

num_beams=3,

temperature=0.8,

do_sample=True,

early_stopping=True

)

enhanced_sentence = self.tokenizers['style'].decode(

outputs[0],

skip_special_tokens=True

)

enhanced_sentences.append(enhanced_sentence.strip())

return ' '.join(enhanced_sentences)

def _split_into_sentences(self, text: str) -> List[str]:

"""

Split text into individual sentences for processing.

Handles various sentence ending patterns and edge cases.

Args:

text: Input text to split

Returns:

List[str]: List of individual sentences

"""

# Use regex to split on sentence boundaries

sentence_pattern = r'(?<=[.!?])\s+'

sentences = re.split(sentence_pattern, text)

# Filter out empty sentences

sentences = [s.strip() for s in sentences if s.strip()]

return sentences

The style enhancement process works at the sentence level to ensure that improvements maintain local coherence while contributing to overall text quality. Each sentence is processed individually through the paraphrasing model, which has been trained to generate alternative phrasings that improve clarity and readability.

The sentence splitting function uses regular expressions to identify sentence boundaries accurately. It handles various punctuation patterns and edge cases to ensure that the text is divided appropriately for individual sentence processing.

CONTENT EXTENSION AND ARTICLE EXPANSION

The content extension capability allows the chatbot to generate additional relevant content that expands on existing articles or documents. This feature analyzes the context, style, and subject matter of the original text to produce coherent extensions.

def extend_content(self, text: str, extension_length: int = 200) -> str:

"""

Generate additional content that extends the input text while

maintaining consistency in style, tone, and subject matter.

Args:

text: Original text to extend

extension_length: Desired length of extension in tokens

Returns:

str: Extended text with additional relevant content

"""

# Analyze the text to understand context and style

context = self._analyze_text_context(text)

# Extract key themes and topics

key_topics = self._extract_key_topics(text)

# Generate extension prompt based on analysis

extension_prompt = self._create_extension_prompt(text, context, key_topics)

# Generate extension using the content generation model

generated_extension = self.pipelines['generator'](

extension_prompt,

max_length=extension_length,

num_return_sequences=1,

temperature=0.8,

do_sample=True,

pad_token_id=50256

)

# Extract and clean the generated text

extension_text = generated_extension[0]['generated_text']

extension_text = extension_text.replace(extension_prompt, '').strip()

# Ensure the extension flows naturally from the original text

final_extension = self._refine_extension(extension_text, text)

return f"{text}\n\n{final_extension}"

def _analyze_text_context(self, text: str) -> Dict:

"""

Analyze the input text to understand its context, style, and tone.

This analysis guides the content extension process.

Args:

text: Text to analyze

Returns:

Dict: Analysis results including tone, style, and context information

"""

# Determine text type (article, essay, technical document, etc.)

text_type = self._classify_text_type(text)

# Analyze writing style and tone

style_analysis = self._analyze_writing_style(text)

# Extract structural patterns

structure_patterns = self._analyze_text_structure(text)

return {

'text_type': text_type,

'style': style_analysis,

'structure': structure_patterns,

'length': len(text.split()),

'complexity': self._assess_complexity(text)

}

The content extension system performs comprehensive analysis of the input text to understand its characteristics before generating additional content. This analysis includes text type classification, style assessment, and structural pattern recognition to ensure that generated extensions maintain consistency with the original material.

FILE HANDLING AND USER INTERFACE

The chatbot must handle both direct text input and file uploads seamlessly. This requires robust file processing capabilities and a user-friendly interface that accommodates different input methods.

def process_file_input(self, file_path: str) -> str:

"""

Process text files uploaded by users, supporting multiple formats

including plain text, markdown, and basic document formats.

Args:

file_path: Path to the uploaded file

Returns:

str: Extracted text content from the file

"""

try:

file_extension = os.path.splitext(file_path)[1].lower()

if file_extension in ['.txt', '.md']:

with open(file_path, 'r', encoding='utf-8') as file:

content = file.read()

elif file_extension == '.docx':

# Handle Word documents (requires python-docx)

content = self._extract_from_docx(file_path)

elif file_extension == '.pdf':

# Handle PDF files (requires PyPDF2 or similar)

content = self._extract_from_pdf(file_path)

else:

raise ValueError(f"Unsupported file format: {file_extension}")

return content.strip()

except Exception as e:

raise Exception(f"Error processing file: {str(e)}")

def chat_interface(self, user_input: str, file_path: Optional[str] = None) -> str:

"""

Main interface method that handles user interactions, processes input,

and returns improved text or extensions based on user requests.

Args:

user_input: Direct text input or instructions from user

file_path: Optional path to uploaded file

Returns:

str: Processed and improved text response

"""

try:

# Determine input source and extract text

if file_path:

text_to_process = self.process_file_input(file_path)

operation_mode = self._determine_operation_from_input(user_input)

else:

text_to_process = user_input

operation_mode = 'improve' # Default operation

# Detect language

detected_language = self.detect_language(text_to_process)

# Process text through improvement pipeline

processed_text = self._process_text_pipeline(

text_to_process,

detected_language,

operation_mode

)

return processed_text

except Exception as e:

return f"Error processing your request: {str(e)}

The file handling system supports multiple file formats and provides appropriate error handling for unsupported formats or corrupted files. The chat interface serves as the main entry point for user interactions, intelligently routing requests based on input type and user instructions.

ERROR HANDLING AND OPTIMIZATION

Robust error handling and performance optimization ensure that the chatbot operates reliably under various conditions and provides consistent user experience even when encountering unexpected situations.

def _process_text_pipeline(self, text: str, language: str, mode: str) -> str:

"""

Execute the complete text processing pipeline with error handling

and optimization for different operation modes.

Args:

text: Input text to process

language: Detected language code

mode: Operation mode ('improve', 'extend', 'both')

Returns:

str: Fully processed text

"""

try:

# Preprocess text

preprocessed_text = self.preprocess_text(text)

if mode in ['improve', 'both']:

# Apply grammar correction

grammar_corrected = self.improve_grammar(preprocessed_text, language)

# Apply style enhancement

style_enhanced = self.enhance_style(grammar_corrected)

result_text = style_enhanced

else:

result_text = preprocessed_text

if mode in ['extend', 'both']:

# Extend content if requested

extended_text = self.extend_content(result_text)

result_text = extended_text

return result_text

except torch.cuda.OutOfMemoryError:

# Handle GPU memory issues

torch.cuda.empty_cache()

return self._process_with_cpu_fallback(text, language, mode)

except Exception as e:

return f"Processing error: {str(e)}"

def _process_with_cpu_fallback(self, text: str, language: str, mode: str) -> str:

"""

Fallback processing method using CPU when GPU memory is insufficient.

Implements reduced batch sizes and simplified processing.

Args:

text: Input text to process

language: Language code

mode: Operation mode

Returns:

str: Processed text using CPU resources

"""

# Move models to CPU temporarily

for model_name in self.models:

self.models[model_name] = self.models[model_name].cpu()

# Process with reduced complexity

result = self._process_text_pipeline(text, language, mode)

# Move models back to GPU if available

if torch.cuda.is_available():

for model_name in self.models:

self.models[model_name] = self.models[model_name].to(self.device)

return result

The error handling system includes specific provisions for common issues such as GPU memory limitations, model loading failures, and processing timeouts. The CPU fallback mechanism ensures that the chatbot remains functional even when GPU resources are unavailable or insufficient.

PERFORMANCE MONITORING AND OPTIMIZATION

Monitoring system performance and implementing optimization strategies ensures that the chatbot operates efficiently and provides responsive user experience across different hardware configurations.

def optimize_model_performance(self):

"""

Apply various optimization techniques to improve model performance

including quantization, caching, and memory management.

"""

# Enable model optimization features

for model_name in self.models:

if hasattr(self.models[model_name], 'half'):

# Use half precision for memory efficiency

self.models[model_name] = self.models[model_name].half()

# Implement response caching for repeated requests

self.response_cache = {}

self.cache_size_limit = 100

def get_system_status(self) -> Dict:

"""

Return current system status including model states,

memory usage, and performance metrics.

Returns:

Dict: System status information

"""

status = {

'models_loaded': len(self.models),

'device': str(self.device),

'cache_size': len(getattr(self, 'response_cache', {}))

}

if torch.cuda.is_available():

status['gpu_memory_allocated'] = torch.cuda.memory_allocated()

status['gpu_memory_cached'] = torch.cuda.memory_reserved()

return status

Performance optimization includes memory management strategies, model quantization for reduced memory usage, and response caching to avoid redundant processing of similar requests. The system status monitoring provides insights into resource utilization and helps identify potential performance bottlenecks.

ADDENDUM: COMPLETE RUNNING EXAMPLE

#!/usr/bin/env python3

"""

Complete Text Beautification and Improvement LLM Chatbot

Using HuggingFace Libraries

This is a fully functional implementation that demonstrates all concepts

discussed in the article. The chatbot can improve text quality, detect

languages, handle file uploads, and extend content.

Requirements:

pip install transformers torch langdetect python-docx PyPDF2

"""

from transformers import (

AutoTokenizer, AutoModelForSeq2SeqLM,

AutoModelForSequenceClassification, pipeline

)

from langdetect import detect

import torch

import re

import os

import hashlib

from typing import Dict, List, Tuple, Optional

import logging

# Configure logging

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

class TextBeautificationChatbot:

def __init__(self):

"""

Initialize the complete text beautification chatbot with all

necessary models, pipelines, and optimization features.

"""

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

logger.info(f"Initializing chatbot on device: {self.device}")

self.models = {}

self.tokenizers = {}

self.pipelines = {}

self.response_cache = {}

self.cache_size_limit = 100

# Initialize all models and components

self._initialize_models()

self._setup_optimization()

logger.info("Chatbot initialization complete")

def _initialize_models(self):

"""

Load and initialize all required models for comprehensive

text processing including grammar, style, and content generation.

"""

try:

# Grammar and spelling correction model

logger.info("Loading grammar correction model...")

self.models['grammar'] = AutoModelForSeq2SeqLM.from_pretrained(

"pszemraj/flan-t5-large-grammar-synthesis"

).to(self.device)

self.tokenizers['grammar'] = AutoTokenizer.from_pretrained(

"pszemraj/flan-t5-large-grammar-synthesis"

)

# Style improvement and paraphrasing model

logger.info("Loading style improvement model...")

self.models['style'] = AutoModelForSeq2SeqLM.from_pretrained(

"tuner007/pegasus_paraphrase"

).to(self.device)

self.tokenizers['style'] = AutoTokenizer.from_pretrained(

"tuner007/pegasus_paraphrase"

)

# Content generation and extension pipeline

logger.info("Loading content generation pipeline...")

self.pipelines['generator'] = pipeline(

"text-generation",

model="gpt2",

device=0 if torch.cuda.is_available() else -1,

tokenizer_kwargs={'pad_token': '[PAD]'}

)

logger.info("All models loaded successfully")

except Exception as e:

logger.error(f"Error loading models: {str(e)}")

raise

def _setup_optimization(self):

"""

Configure optimization settings for improved performance

including memory management and caching strategies.

"""

# Enable half precision for memory efficiency on compatible hardware

if torch.cuda.is_available():

for model_name in self.models:

try:

self.models[model_name] = self.models[model_name].half()

except:

logger.warning(f"Could not convert {model_name} to half precision")

# Configure tokenizer padding

for tokenizer_name in self.tokenizers:

if self.tokenizers[tokenizer_name].pad_token is None:

self.tokenizers[tokenizer_name].pad_token = self.tokenizers[tokenizer_name].eos_token

def detect_language(self, text: str) -> str:

"""

Detect the language of input text using statistical analysis

with comprehensive error handling and fallback mechanisms.

Args:

text: Input text for language detection

Returns:

str: ISO language code (e.g., 'en', 'es', 'fr')

"""

try:

# Clean text for better detection accuracy

clean_text = re.sub(r'[^\w\s]', ' ', text)

clean_text = re.sub(r'\d+', '', clean_text)

clean_text = re.sub(r'\s+', ' ', clean_text).strip()

# Require minimum text length for reliable detection

if len(clean_text.split()) < 3:

return 'en' # Default to English for very short texts

detected_lang = detect(clean_text)

logger.info(f"Detected language: {detected_lang}")

return detected_lang

except Exception as e:

logger.warning(f"Language detection failed: {str(e)}, defaulting to English")

return 'en' # Fallback to English if detection fails

def preprocess_text(self, text: str) -> str:

"""

Comprehensive text preprocessing including whitespace normalization,

punctuation correction, and basic formatting improvements.

Args:

text: Raw input text requiring preprocessing

Returns:

str: Cleaned and normalized text ready for processing

"""

# Remove excessive whitespace and normalize spacing

text = re.sub(r'\s+', ' ', text)

# Fix punctuation spacing issues

text = re.sub(r'\s+([,.!?;:])', r'\1', text)

text = re.sub(r'([.!?])\s*([A-Z])', r'\1 \2', text)

# Ensure proper sentence spacing

text = re.sub(r'([.!?])\s+', r'\1 ', text)

# Fix quotation mark spacing

text = re.sub(r'\s+"([^"]*)"', r' "\1"', text)

# Remove leading/trailing whitespace

text = text.strip()

return text

def improve_grammar(self, text: str, language: str = 'en') -> str:

"""

Correct grammatical errors, spelling mistakes, and punctuation

issues using transformer models with comprehensive error handling.

Args:

text: Input text requiring grammar correction

language: Language code for language-specific processing

Returns:

str: Grammar-corrected text

"""

try:

# Check cache first

cache_key = self._generate_cache_key(f"grammar_{text}_{language}")

if cache_key in self.response_cache:

return self.response_cache[cache_key]

# Prepare input for the grammar model

input_text = f"grammar: {text}"

# Tokenize with proper handling of long texts

inputs = self.tokenizers['grammar'].encode(

input_text,

return_tensors="pt",

max_length=512,

truncation=True,

padding=True

).to(self.device)

# Generate corrected text with optimized parameters

with torch.no_grad():

outputs = self.models['grammar'].generate(

inputs,

max_length=512,

num_beams=4,

temperature=0.7,

do_sample=True,

early_stopping=True,

pad_token_id=self.tokenizers['grammar'].pad_token_id

)

# Decode and clean the corrected text

corrected_text = self.tokenizers['grammar'].decode(

outputs[0],

skip_special_tokens=True

)

# Remove the input prefix if present

if corrected_text.startswith("grammar:"):

corrected_text = corrected_text[8:].strip()

# Cache the result

self._cache_response(cache_key, corrected_text)

return corrected_text

except Exception as e:

logger.error(f"Grammar correction failed: {str(e)}")

return text # Return original text if correction fails

def enhance_style(self, text: str) -> str:

"""

Improve text style, readability, and flow while preserving

meaning through sentence-level processing and enhancement.

Args:

text: Grammar-corrected text requiring style improvement

Returns:

str: Style-enhanced text with improved readability

"""

try:

# Check cache first

cache_key = self._generate_cache_key(f"style_{text}")

if cache_key in self.response_cache:

return self.response_cache[cache_key]

# Split text into sentences for individual processing

sentences = self._split_into_sentences(text)

enhanced_sentences = []

for sentence in sentences:

if len(sentence.strip()) < 10:

enhanced_sentences.append(sentence)

continue

try:

# Prepare input for style model

input_text = f"paraphrase: {sentence}"

# Tokenize with appropriate length limits

inputs = self.tokenizers['style'].encode(

input_text,

return_tensors="pt",

max_length=256,

truncation=True,

padding=True

).to(self.device)

# Generate enhanced version

with torch.no_grad():

outputs = self.models['style'].generate(

inputs,

max_length=256,

num_beams=3,

temperature=0.8,

do_sample=True,

early_stopping=True,

pad_token_id=self.tokenizers['style'].pad_token_id

)

enhanced_sentence = self.tokenizers['style'].decode(

outputs[0],

skip_special_tokens=True

)

# Clean the enhanced sentence

if enhanced_sentence.startswith("paraphrase:"):

enhanced_sentence = enhanced_sentence[11:].strip()

enhanced_sentences.append(enhanced_sentence)

except Exception as e:

logger.warning(f"Style enhancement failed for sentence: {str(e)}")

enhanced_sentences.append(sentence) # Keep original if enhancement fails

result = ' '.join(enhanced_sentences)

# Cache the result

self._cache_response(cache_key, result)

return result

except Exception as e:

logger.error(f"Style enhancement failed: {str(e)}")

return text # Return original text if enhancement fails

def extend_content(self, text: str, extension_length: int = 100) -> str:

"""

Generate additional relevant content that extends the input text

while maintaining consistency in style, tone, and subject matter.

Args:

text: Original text to extend

extension_length: Desired length of extension in tokens

Returns:

str: Extended text with additional coherent content

"""

try:

# Check cache first

cache_key = self._generate_cache_key(f"extend_{text}_{extension_length}")

if cache_key in self.response_cache:

return self.response_cache[cache_key]

# Analyze text to create appropriate extension prompt

last_sentences = self._get_last_sentences(text, 2)

extension_prompt = f"{last_sentences} Furthermore,"

# Generate extension using the content generation pipeline

generated_extension = self.pipelines['generator'](

extension_prompt,

max_length=len(extension_prompt.split()) + extension_length,

num_return_sequences=1,

temperature=0.8,

do_sample=True,

pad_token_id=50256

)

# Extract and clean the generated text

extension_text = generated_extension[0]['generated_text']

extension_text = extension_text.replace(extension_prompt, '').strip()

# Clean up the extension

extension_text = self._clean_generated_text(extension_text)

# Combine original text with extension

if extension_text:

result = f"{text}\n\n{extension_text}"

else:

result = text

# Cache the result

self._cache_response(cache_key, result)

return result

except Exception as e:

logger.error(f"Content extension failed: {str(e)}")

return text # Return original text if extension fails

def process_file_input(self, file_path: str) -> str:

"""

Process various file formats to extract text content

with comprehensive error handling and format support.

Args:

file_path: Path to the uploaded file

Returns:

str: Extracted text content from the file

"""

try:

if not os.path.exists(file_path):

raise FileNotFoundError(f"File not found: {file_path}")

file_extension = os.path.splitext(file_path)[1].lower()

if file_extension in ['.txt', '.md']:

with open(file_path, 'r', encoding='utf-8') as file:

content = file.read()

else:

raise ValueError(f"Unsupported file format: {file_extension}")

return content.strip()

except Exception as e:

logger.error(f"File processing error: {str(e)}")

raise Exception(f"Error processing file: {str(e)}")

def chat_interface(self, user_input: str, file_path: Optional[str] = None,

operation_mode: str = 'improve') -> str:

"""

Main interface method handling user interactions with comprehensive

processing pipeline and intelligent operation routing.

Args:

user_input: Direct text input or instructions from user

file_path: Optional path to uploaded file

operation_mode: Processing mode ('improve', 'extend', 'both')

Returns:

str: Processed and improved text response

"""

try:

# Determine input source and extract text

if file_path:

text_to_process = self.process_file_input(file_path)

logger.info(f"Processing file: {file_path}")

else:

text_to_process = user_input

logger.info("Processing direct text input")

# Validate input

if not text_to_process or len(text_to_process.strip()) < 5:

return "Please provide text that is at least 5 characters long for processing."

# Detect language

detected_language = self.detect_language(text_to_process)

# Process text through the complete pipeline

processed_text = self._process_text_pipeline(

text_to_process,

detected_language,

operation_mode

)

return processed_text

except Exception as e:

logger.error(f"Chat interface error: {str(e)}")

return f"Error processing your request: {str(e)}"

def _process_text_pipeline(self, text: str, language: str, mode: str) -> str:

"""

Execute the complete text processing pipeline with comprehensive

error handling and optimization for different operation modes.

Args:

text: Input text to process

language: Detected language code

mode: Operation mode ('improve', 'extend', 'both')

Returns:

str: Fully processed text

"""

try:

# Preprocess text

preprocessed_text = self.preprocess_text(text)

if mode in ['improve', 'both']:

# Apply grammar correction

grammar_corrected = self.improve_grammar(preprocessed_text, language)

# Apply style enhancement

style_enhanced = self.enhance_style(grammar_corrected)

result_text = style_enhanced

else:

result_text = preprocessed_text

if mode in ['extend', 'both']:

# Extend content if requested

extended_text = self.extend_content(result_text)

result_text = extended_text

return result_text

except torch.cuda.OutOfMemoryError:

# Handle GPU memory issues with CPU fallback

logger.warning("GPU memory exhausted, falling back to CPU processing")

torch.cuda.empty_cache()

return self._process_with_cpu_fallback(text, language, mode)

except Exception as e:

logger.error(f"Pipeline processing error: {str(e)}")

return f"Processing error: {str(e)}"

def _process_with_cpu_fallback(self, text: str, language: str, mode: str) -> str:

"""

Fallback processing method using CPU resources when GPU

memory is insufficient or unavailable.

Args:

text: Input text to process

language: Language code

mode: Operation mode

Returns:

str: Processed text using CPU resources

"""

# Temporarily move models to CPU

original_device = self.device

self.device = torch.device("cpu")

for model_name in self.models:

self.models[model_name] = self.models[model_name].cpu()

try:

# Process with CPU

result = self._process_text_pipeline(text, language, mode)

finally:

# Restore original device configuration

self.device = original_device

if torch.cuda.is_available():

for model_name in self.models:

self.models[model_name] = self.models[model_name].to(self.device)

return result

def _split_into_sentences(self, text: str) -> List[str]:

"""

Split text into individual sentences with sophisticated

boundary detection and edge case handling.

Args:

text: Input text to split into sentences

Returns:

List[str]: List of individual sentences

"""

# Enhanced sentence splitting pattern

sentence_pattern = r'(?<=[.!?])\s+(?=[A-Z])'

sentences = re.split(sentence_pattern, text)

# Filter out empty sentences and clean whitespace

sentences = [s.strip() for s in sentences if s.strip()]

return sentences

def _get_last_sentences(self, text: str, count: int = 2) -> str:

"""

Extract the last few sentences from text for context

in content extension operations.

Args:

text: Input text

count: Number of sentences to extract

Returns:

str: Last sentences joined together

"""

sentences = self._split_into_sentences(text)

if len(sentences) <= count:

return text

return ' '.join(sentences[-count:])

def _clean_generated_text(self, text: str) -> str:

"""

Clean and normalize generated text by removing artifacts

and ensuring proper formatting.

Args:

text: Generated text requiring cleaning

Returns:

str: Cleaned and formatted text

"""

# Remove common generation artifacts

text = re.sub(r'\[.*?\]', '', text) # Remove bracketed content

text = re.sub(r'<.*?>', '', text) # Remove angle-bracketed content

# Fix spacing and punctuation

text = re.sub(r'\s+', ' ', text)

text = text.strip()

# Ensure text ends with proper punctuation

if text and not text[-1] in '.!?':

text += '.'

return text

def _generate_cache_key(self, text: str) -> str:

"""

Generate a unique cache key for response caching

using content hashing for efficient lookup.

Args:

text: Input text for cache key generation

Returns:

str: Unique cache key

"""

return hashlib.md5(text.encode()).hexdigest()

def _cache_response(self, key: str, response: str):

"""

Cache response with size limit management to prevent

excessive memory usage.

Args:

key: Cache key

response: Response to cache

"""

if len(self.response_cache) >= self.cache_size_limit:

# Remove oldest entry

oldest_key = next(iter(self.response_cache))

del self.response_cache[oldest_key]

self.response_cache[key] = response

def get_system_status(self) -> Dict:

"""

Return comprehensive system status including model states,

memory usage, and performance metrics.

Returns:

Dict: Complete system status information

"""

status = {

'models_loaded': len(self.models),

'pipelines_loaded': len(self.pipelines),

'device': str(self.device),

'cache_size': len(self.response_cache),

'cache_limit': self.cache_size_limit

}

if torch.cuda.is_available():

status['gpu_available'] = True

status['gpu_memory_allocated'] = torch.cuda.memory_allocated()

status['gpu_memory_cached'] = torch.cuda.memory_reserved()

else:

status['gpu_available'] = False

return status

def clear_cache(self):

"""Clear the response cache to free memory."""

self.response_cache.clear()

logger.info("Response cache cleared")

# Example usage and demonstration

def main():

"""

Demonstrate the complete functionality of the text beautification

chatbot with various input types and processing modes.

"""

print("Initializing Text Beautification Chatbot...")

chatbot = TextBeautificationChatbot()

# Example 1: Direct text improvement

sample_text = """

this is a sample text that has some grammar mistakes and could use some improvement.

the sentences are not very well structured and the overall quality could be better.

"""

print("\n" + "="*60)

print("EXAMPLE 1: Text Improvement")

print("="*60)

print("Original text:")

print(sample_text)

improved_text = chatbot.chat_interface(sample_text, operation_mode='improve')

print("\nImproved text:")

print(improved_text)

# Example 2: Content extension

article_text = """

Artificial intelligence has revolutionized many industries in recent years.

Machine learning algorithms can now process vast amounts of data and identify

patterns that humans might miss. This capability has led to breakthroughs

in healthcare, finance, and technology sectors.

"""

print("\n" + "="*60)

print("EXAMPLE 2: Content Extension")

print("="*60)

print("Original article:")

print(article_text)

extended_text = chatbot.chat_interface(article_text, operation_mode='extend')

print("\nExtended article:")

print(extended_text)

# Example 3: Complete processing (improvement + extension)

print("\n" + "="*60)

print("EXAMPLE 3: Complete Processing")

print("="*60)

print("Original text:")

print(sample_text)

complete_processed = chatbot.chat_interface(sample_text, operation_mode='both')

print("\nCompletely processed text:")

print(complete_processed)

# Display system status

print("\n" + "="*60)

print("SYSTEM STATUS")

print("="*60)

status = chatbot.get_system_status()

for key, value in status.items():

print(f"{key}: {value}")

if __name__ == "__main__":

main()

This complete implementation demonstrates all the concepts discussed in the article and provides a fully functional text beautification chatbot. The system handles grammar correction, style improvement, content extension, file processing, and multilingual support while maintaining robust error handling and performance optimization. The example usage section shows how to interact with the chatbot for different types of text processing tasks.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Thursday, January 29, 2026

BUILDING A SIMPLE TEXT BEAUTIFICATION AND IMPROVEMENT LLM CHATBOT USING HUGGINGFACE LIBRARIES

INTRODUCTION

ARCHITECTURAL OVERVIEW AND CORE COMPONENTS

HUGGINGFACE LIBRARIES AND MODEL SELECTION

LANGUAGE DETECTION AND MULTILINGUAL SUPPORT

TEXT PREPROCESSING AND IMPROVEMENT PIPELINE

STYLE ENHANCEMENT AND READABILITY IMPROVEMENT

CONTENT EXTENSION AND ARTICLE EXPANSION

FILE HANDLING AND USER INTERFACE

ERROR HANDLING AND OPTIMIZATION

PERFORMANCE MONITORING AND OPTIMIZATION

ADDENDUM: COMPLETE RUNNING EXAMPLE

No comments:

About Me