INTRODUCTION
Text beautification and improvement represents a sophisticated application of natural language processing that goes beyond simple grammar correction. A well-designed chatbot for this purpose can enhance writing quality, improve clarity, adjust tone, expand content meaningfully, and adapt to different languages automatically. This comprehensive guide explores the construction of such a system using HuggingFace's powerful ecosystem of transformer models and supporting libraries.
The chatbot we will build serves multiple functions. It can receive text input directly through conversational prompts or process text files uploaded by users. The system automatically detects the language of the input text and applies appropriate improvements while maintaining the original language and cultural context. Additionally, the chatbot can extend articles or documents by generating relevant, coherent additional content that matches the style and subject matter of the original text.
ARCHITECTURAL OVERVIEW AND CORE COMPONENTS
The foundation of our text beautification chatbot rests on several interconnected components that work together to process, analyze, and improve text input. The architecture follows a modular design pattern that separates concerns and allows for easy maintenance and extension.
The primary components include a language detection module, a text preprocessing pipeline, multiple specialized transformer models for different improvement tasks, a content extension engine, and a user interface layer that handles both direct text input and file uploads. Each component communicates through well-defined interfaces, ensuring loose coupling and high cohesion.
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
AutoModelForSequenceClassification, pipeline
)
from langdetect import detect
import torch
import re
import os
from typing import Dict, List, Tuple, Optional
class TextBeautificationChatbot:
def __init__(self):
"""
Initialize the chatbot with necessary models and pipelines.
Sets up language detection, grammar correction, style improvement,
and content extension capabilities.
"""
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.models = {}
self.tokenizers = {}
self.pipelines = {}
self._initialize_models()
The initialization method establishes the computational environment and prepares the various models that will be used throughout the text processing pipeline. The system automatically detects whether GPU acceleration is available and configures the models accordingly for optimal performance.
HUGGINGFACE LIBRARIES AND MODEL SELECTION
HuggingFace provides an extensive collection of pre-trained models that excel at different aspects of text processing. For our beautification chatbot, we carefully select models that complement each other and cover the full spectrum of text improvement tasks.
The grammar correction component utilizes models specifically fine-tuned for error detection and correction. These models have been trained on large datasets of text pairs containing errors and their corrections, enabling them to identify and fix grammatical mistakes, spelling errors, and punctuation issues.
def _initialize_models(self):
"""
Load and initialize all required models for text processing.
This includes grammar correction, style improvement, and content generation models.
"""
# Grammar and spelling correction model
self.models['grammar'] = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/flan-t5-large-grammar-synthesis"
).to(self.device)
self.tokenizers['grammar'] = AutoTokenizer.from_pretrained(
"pszemraj/flan-t5-large-grammar-synthesis"
)
# Style improvement and paraphrasing model
self.models['style'] = AutoModelForSeq2SeqLM.from_pretrained(
"tuner007/pegasus_paraphrase"
).to(self.device)
self.tokenizers['style'] = AutoTokenizer.from_pretrained(
"tuner007/pegasus_paraphrase"
)
# Content generation and extension model
self.pipelines['generator'] = pipeline(
"text-generation",
model="microsoft/DialoGPT-large",
device=0 if torch.cuda.is_available() else -1
)
The style improvement model focuses on enhancing the overall quality and readability of text. It can rephrase sentences to improve flow, eliminate redundancy, and enhance clarity while preserving the original meaning. The content generation model enables the chatbot to extend articles or documents by producing relevant additional content that maintains consistency with the existing text.
LANGUAGE DETECTION AND MULTILINGUAL SUPPORT
Supporting multiple languages requires sophisticated detection mechanisms and language-specific processing pipelines. The chatbot must accurately identify the language of input text and apply appropriate models and techniques for that specific language.
def detect_language(self, text: str) -> str:
"""
Detect the language of the input text using statistical analysis.
Returns ISO language code for the detected language.
Args:
text: Input text for language detection
Returns:
str: ISO language code (e.g., 'en', 'es', 'fr')
"""
try:
# Remove special characters and numbers for better detection
clean_text = re.sub(r'[^\w\s]', ' ', text)
clean_text = re.sub(r'\d+', '', clean_text)
if len(clean_text.strip()) < 10:
return 'en' # Default to English for very short texts
detected_lang = detect(clean_text)
return detected_lang
except:
return 'en' # Fallback to English if detection fails
def get_language_specific_models(self, language_code: str) -> Dict:
"""
Return appropriate models based on detected language.
Some models work better for specific languages.
Args:
language_code: ISO language code
Returns:
Dict: Dictionary containing language-specific model configurations
"""
language_models = {
'en': {
'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',
'style_model': 'tuner007/pegasus_paraphrase'
},
'es': {
'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',
'style_model': 'tuner007/pegasus_paraphrase'
},
'fr': {
'grammar_model': 'pszemraj/flan-t5-large-grammar-synthesis',
'style_model': 'tuner007/pegasus_paraphrase'
}
}
return language_models.get(language_code, language_models['en'])
The language detection system employs statistical analysis to identify the language of input text. It preprocesses the text by removing special characters and numbers that might interfere with accurate detection. For very short text snippets where detection might be unreliable, the system defaults to English while maintaining the capability to handle longer texts in multiple languages.
TEXT PREPROCESSING AND IMPROVEMENT PIPELINE
The text improvement pipeline represents the core functionality of our chatbot. It processes input text through multiple stages, each designed to address specific aspects of text quality and readability.
def preprocess_text(self, text: str) -> str:
"""
Clean and prepare text for processing by removing unnecessary
whitespace, fixing basic formatting issues, and normalizing structure.
Args:
text: Raw input text
Returns:
str: Preprocessed text ready for improvement
"""
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Fix common punctuation spacing issues
text = re.sub(r'\s+([,.!?;:])', r'\1', text)
text = re.sub(r'([.!?])\s*([A-Z])', r'\1 \2', text)
# Ensure proper sentence spacing
text = re.sub(r'([.!?])\s+', r'\1 ', text)
# Remove leading/trailing whitespace
text = text.strip()
return text
def improve_grammar(self, text: str, language: str = 'en') -> str:
"""
Correct grammatical errors, spelling mistakes, and punctuation issues
in the input text using transformer models.
Args:
text: Input text to be corrected
language: Language code for language-specific processing
Returns:
str: Grammar-corrected text
"""
# Prepare input for the grammar model
input_text = f"grammar: {text}"
# Tokenize input
inputs = self.tokenizers['grammar'].encode(
input_text,
return_tensors="pt",
max_length=512,
truncation=True
).to(self.device)
# Generate corrected text
with torch.no_grad():
outputs = self.models['grammar'].generate(
inputs,
max_length=512,
num_beams=4,
temperature=0.7,
do_sample=True,
early_stopping=True
)
# Decode and return corrected text
corrected_text = self.tokenizers['grammar'].decode(
outputs[0],
skip_special_tokens=True
)
return corrected_text.strip()
The preprocessing stage handles basic text cleaning and formatting normalization. It addresses common issues such as excessive whitespace, improper punctuation spacing, and inconsistent sentence structure. This preparation ensures that subsequent processing stages receive well-formatted input, leading to better results.
The grammar improvement function leverages transformer models specifically trained for error correction. It uses beam search decoding to generate multiple candidate corrections and selects the most appropriate one. The temperature parameter controls the creativity of the generation process, balancing between conservative corrections and more substantial improvements.
STYLE ENHANCEMENT AND READABILITY IMPROVEMENT
Style enhancement goes beyond basic grammar correction to improve the overall quality, flow, and readability of text. This component analyzes sentence structure, word choice, and overall coherence to produce more polished and engaging content.
def enhance_style(self, text: str) -> str:
"""
Improve text style, readability, and flow while preserving meaning.
This includes sentence restructuring, word choice optimization,
and coherence enhancement.
Args:
text: Grammar-corrected text
Returns:
str: Style-enhanced text
"""
# Split text into sentences for individual processing
sentences = self._split_into_sentences(text)
enhanced_sentences = []
for sentence in sentences:
if len(sentence.strip()) < 10:
enhanced_sentences.append(sentence)
continue
# Prepare input for style model
input_text = f"paraphrase: {sentence}"
# Tokenize
inputs = self.tokenizers['style'].encode(
input_text,
return_tensors="pt",
max_length=256,
truncation=True
).to(self.device)
# Generate enhanced version
with torch.no_grad():
outputs = self.models['style'].generate(
inputs,
max_length=256,
num_beams=3,
temperature=0.8,
do_sample=True,
early_stopping=True
)
enhanced_sentence = self.tokenizers['style'].decode(
outputs[0],
skip_special_tokens=True
)
enhanced_sentences.append(enhanced_sentence.strip())
return ' '.join(enhanced_sentences)
def _split_into_sentences(self, text: str) -> List[str]:
"""
Split text into individual sentences for processing.
Handles various sentence ending patterns and edge cases.
Args:
text: Input text to split
Returns:
List[str]: List of individual sentences
"""
# Use regex to split on sentence boundaries
sentence_pattern = r'(?<=[.!?])\s+'
sentences = re.split(sentence_pattern, text)
# Filter out empty sentences
sentences = [s.strip() for s in sentences if s.strip()]
return sentences
The style enhancement process works at the sentence level to ensure that improvements maintain local coherence while contributing to overall text quality. Each sentence is processed individually through the paraphrasing model, which has been trained to generate alternative phrasings that improve clarity and readability.
The sentence splitting function uses regular expressions to identify sentence boundaries accurately. It handles various punctuation patterns and edge cases to ensure that the text is divided appropriately for individual sentence processing.
CONTENT EXTENSION AND ARTICLE EXPANSION
The content extension capability allows the chatbot to generate additional relevant content that expands on existing articles or documents. This feature analyzes the context, style, and subject matter of the original text to produce coherent extensions.
def extend_content(self, text: str, extension_length: int = 200) -> str:
"""
Generate additional content that extends the input text while
maintaining consistency in style, tone, and subject matter.
Args:
text: Original text to extend
extension_length: Desired length of extension in tokens
Returns:
str: Extended text with additional relevant content
"""
# Analyze the text to understand context and style
context = self._analyze_text_context(text)
# Extract key themes and topics
key_topics = self._extract_key_topics(text)
# Generate extension prompt based on analysis
extension_prompt = self._create_extension_prompt(text, context, key_topics)
# Generate extension using the content generation model
generated_extension = self.pipelines['generator'](
extension_prompt,
max_length=extension_length,
num_return_sequences=1,
temperature=0.8,
do_sample=True,
pad_token_id=50256
)
# Extract and clean the generated text
extension_text = generated_extension[0]['generated_text']
extension_text = extension_text.replace(extension_prompt, '').strip()
# Ensure the extension flows naturally from the original text
final_extension = self._refine_extension(extension_text, text)
return f"{text}\n\n{final_extension}"
def _analyze_text_context(self, text: str) -> Dict:
"""
Analyze the input text to understand its context, style, and tone.
This analysis guides the content extension process.
Args:
text: Text to analyze
Returns:
Dict: Analysis results including tone, style, and context information
"""
# Determine text type (article, essay, technical document, etc.)
text_type = self._classify_text_type(text)
# Analyze writing style and tone
style_analysis = self._analyze_writing_style(text)
# Extract structural patterns
structure_patterns = self._analyze_text_structure(text)
return {
'text_type': text_type,
'style': style_analysis,
'structure': structure_patterns,
'length': len(text.split()),
'complexity': self._assess_complexity(text)
}
The content extension system performs comprehensive analysis of the input text to understand its characteristics before generating additional content. This analysis includes text type classification, style assessment, and structural pattern recognition to ensure that generated extensions maintain consistency with the original material.
FILE HANDLING AND USER INTERFACE
The chatbot must handle both direct text input and file uploads seamlessly. This requires robust file processing capabilities and a user-friendly interface that accommodates different input methods.
def process_file_input(self, file_path: str) -> str:
"""
Process text files uploaded by users, supporting multiple formats
including plain text, markdown, and basic document formats.
Args:
file_path: Path to the uploaded file
Returns:
str: Extracted text content from the file
"""
try:
file_extension = os.path.splitext(file_path)[1].lower()
if file_extension in ['.txt', '.md']:
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
elif file_extension == '.docx':
# Handle Word documents (requires python-docx)
content = self._extract_from_docx(file_path)
elif file_extension == '.pdf':
# Handle PDF files (requires PyPDF2 or similar)
content = self._extract_from_pdf(file_path)
else:
raise ValueError(f"Unsupported file format: {file_extension}")
return content.strip()
except Exception as e:
raise Exception(f"Error processing file: {str(e)}")
def chat_interface(self, user_input: str, file_path: Optional[str] = None) -> str:
"""
Main interface method that handles user interactions, processes input,
and returns improved text or extensions based on user requests.
Args:
user_input: Direct text input or instructions from user
file_path: Optional path to uploaded file
Returns:
str: Processed and improved text response
"""
try:
# Determine input source and extract text
if file_path:
text_to_process = self.process_file_input(file_path)
operation_mode = self._determine_operation_from_input(user_input)
else:
text_to_process = user_input
operation_mode = 'improve' # Default operation
# Detect language
detected_language = self.detect_language(text_to_process)
# Process text through improvement pipeline
processed_text = self._process_text_pipeline(
text_to_process,
detected_language,
operation_mode
)
return processed_text
except Exception as e:
return f"Error processing your request: {str(e)}
The file handling system supports multiple file formats and provides appropriate error handling for unsupported formats or corrupted files. The chat interface serves as the main entry point for user interactions, intelligently routing requests based on input type and user instructions.
ERROR HANDLING AND OPTIMIZATION
Robust error handling and performance optimization ensure that the chatbot operates reliably under various conditions and provides consistent user experience even when encountering unexpected situations.
def _process_text_pipeline(self, text: str, language: str, mode: str) -> str:
"""
Execute the complete text processing pipeline with error handling
and optimization for different operation modes.
Args:
text: Input text to process
language: Detected language code
mode: Operation mode ('improve', 'extend', 'both')
Returns:
str: Fully processed text
"""
try:
# Preprocess text
preprocessed_text = self.preprocess_text(text)
if mode in ['improve', 'both']:
# Apply grammar correction
grammar_corrected = self.improve_grammar(preprocessed_text, language)
# Apply style enhancement
style_enhanced = self.enhance_style(grammar_corrected)
result_text = style_enhanced
else:
result_text = preprocessed_text
if mode in ['extend', 'both']:
# Extend content if requested
extended_text = self.extend_content(result_text)
result_text = extended_text
return result_text
except torch.cuda.OutOfMemoryError:
# Handle GPU memory issues
torch.cuda.empty_cache()
return self._process_with_cpu_fallback(text, language, mode)
except Exception as e:
return f"Processing error: {str(e)}"
def _process_with_cpu_fallback(self, text: str, language: str, mode: str) -> str:
"""
Fallback processing method using CPU when GPU memory is insufficient.
Implements reduced batch sizes and simplified processing.
Args:
text: Input text to process
language: Language code
mode: Operation mode
Returns:
str: Processed text using CPU resources
"""
# Move models to CPU temporarily
for model_name in self.models:
self.models[model_name] = self.models[model_name].cpu()
# Process with reduced complexity
result = self._process_text_pipeline(text, language, mode)
# Move models back to GPU if available
if torch.cuda.is_available():
for model_name in self.models:
self.models[model_name] = self.models[model_name].to(self.device)
return result
The error handling system includes specific provisions for common issues such as GPU memory limitations, model loading failures, and processing timeouts. The CPU fallback mechanism ensures that the chatbot remains functional even when GPU resources are unavailable or insufficient.
PERFORMANCE MONITORING AND OPTIMIZATION
Monitoring system performance and implementing optimization strategies ensures that the chatbot operates efficiently and provides responsive user experience across different hardware configurations.
def optimize_model_performance(self):
"""
Apply various optimization techniques to improve model performance
including quantization, caching, and memory management.
"""
# Enable model optimization features
for model_name in self.models:
if hasattr(self.models[model_name], 'half'):
# Use half precision for memory efficiency
self.models[model_name] = self.models[model_name].half()
# Implement response caching for repeated requests
self.response_cache = {}
self.cache_size_limit = 100
def get_system_status(self) -> Dict:
"""
Return current system status including model states,
memory usage, and performance metrics.
Returns:
Dict: System status information
"""
status = {
'models_loaded': len(self.models),
'device': str(self.device),
'cache_size': len(getattr(self, 'response_cache', {}))
}
if torch.cuda.is_available():
status['gpu_memory_allocated'] = torch.cuda.memory_allocated()
status['gpu_memory_cached'] = torch.cuda.memory_reserved()
return status
Performance optimization includes memory management strategies, model quantization for reduced memory usage, and response caching to avoid redundant processing of similar requests. The system status monitoring provides insights into resource utilization and helps identify potential performance bottlenecks.
ADDENDUM: COMPLETE RUNNING EXAMPLE
#!/usr/bin/env python3
"""
Complete Text Beautification and Improvement LLM Chatbot
Using HuggingFace Libraries
This is a fully functional implementation that demonstrates all concepts
discussed in the article. The chatbot can improve text quality, detect
languages, handle file uploads, and extend content.
Requirements:
pip install transformers torch langdetect python-docx PyPDF2
"""
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
AutoModelForSequenceClassification, pipeline
)
from langdetect import detect
import torch
import re
import os
import hashlib
from typing import Dict, List, Tuple, Optional
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TextBeautificationChatbot:
def __init__(self):
"""
Initialize the complete text beautification chatbot with all
necessary models, pipelines, and optimization features.
"""
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Initializing chatbot on device: {self.device}")
self.models = {}
self.tokenizers = {}
self.pipelines = {}
self.response_cache = {}
self.cache_size_limit = 100
# Initialize all models and components
self._initialize_models()
self._setup_optimization()
logger.info("Chatbot initialization complete")
def _initialize_models(self):
"""
Load and initialize all required models for comprehensive
text processing including grammar, style, and content generation.
"""
try:
# Grammar and spelling correction model
logger.info("Loading grammar correction model...")
self.models['grammar'] = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/flan-t5-large-grammar-synthesis"
).to(self.device)
self.tokenizers['grammar'] = AutoTokenizer.from_pretrained(
"pszemraj/flan-t5-large-grammar-synthesis"
)
# Style improvement and paraphrasing model
logger.info("Loading style improvement model...")
self.models['style'] = AutoModelForSeq2SeqLM.from_pretrained(
"tuner007/pegasus_paraphrase"
).to(self.device)
self.tokenizers['style'] = AutoTokenizer.from_pretrained(
"tuner007/pegasus_paraphrase"
)
# Content generation and extension pipeline
logger.info("Loading content generation pipeline...")
self.pipelines['generator'] = pipeline(
"text-generation",
model="gpt2",
device=0 if torch.cuda.is_available() else -1,
tokenizer_kwargs={'pad_token': '[PAD]'}
)
logger.info("All models loaded successfully")
except Exception as e:
logger.error(f"Error loading models: {str(e)}")
raise
def _setup_optimization(self):
"""
Configure optimization settings for improved performance
including memory management and caching strategies.
"""
# Enable half precision for memory efficiency on compatible hardware
if torch.cuda.is_available():
for model_name in self.models:
try:
self.models[model_name] = self.models[model_name].half()
except:
logger.warning(f"Could not convert {model_name} to half precision")
# Configure tokenizer padding
for tokenizer_name in self.tokenizers:
if self.tokenizers[tokenizer_name].pad_token is None:
self.tokenizers[tokenizer_name].pad_token = self.tokenizers[tokenizer_name].eos_token
def detect_language(self, text: str) -> str:
"""
Detect the language of input text using statistical analysis
with comprehensive error handling and fallback mechanisms.
Args:
text: Input text for language detection
Returns:
str: ISO language code (e.g., 'en', 'es', 'fr')
"""
try:
# Clean text for better detection accuracy
clean_text = re.sub(r'[^\w\s]', ' ', text)
clean_text = re.sub(r'\d+', '', clean_text)
clean_text = re.sub(r'\s+', ' ', clean_text).strip()
# Require minimum text length for reliable detection
if len(clean_text.split()) < 3:
return 'en' # Default to English for very short texts
detected_lang = detect(clean_text)
logger.info(f"Detected language: {detected_lang}")
return detected_lang
except Exception as e:
logger.warning(f"Language detection failed: {str(e)}, defaulting to English")
return 'en' # Fallback to English if detection fails
def preprocess_text(self, text: str) -> str:
"""
Comprehensive text preprocessing including whitespace normalization,
punctuation correction, and basic formatting improvements.
Args:
text: Raw input text requiring preprocessing
Returns:
str: Cleaned and normalized text ready for processing
"""
# Remove excessive whitespace and normalize spacing
text = re.sub(r'\s+', ' ', text)
# Fix punctuation spacing issues
text = re.sub(r'\s+([,.!?;:])', r'\1', text)
text = re.sub(r'([.!?])\s*([A-Z])', r'\1 \2', text)
# Ensure proper sentence spacing
text = re.sub(r'([.!?])\s+', r'\1 ', text)
# Fix quotation mark spacing
text = re.sub(r'\s+"([^"]*)"', r' "\1"', text)
# Remove leading/trailing whitespace
text = text.strip()
return text
def improve_grammar(self, text: str, language: str = 'en') -> str:
"""
Correct grammatical errors, spelling mistakes, and punctuation
issues using transformer models with comprehensive error handling.
Args:
text: Input text requiring grammar correction
language: Language code for language-specific processing
Returns:
str: Grammar-corrected text
"""
try:
# Check cache first
cache_key = self._generate_cache_key(f"grammar_{text}_{language}")
if cache_key in self.response_cache:
return self.response_cache[cache_key]
# Prepare input for the grammar model
input_text = f"grammar: {text}"
# Tokenize with proper handling of long texts
inputs = self.tokenizers['grammar'].encode(
input_text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True
).to(self.device)
# Generate corrected text with optimized parameters
with torch.no_grad():
outputs = self.models['grammar'].generate(
inputs,
max_length=512,
num_beams=4,
temperature=0.7,
do_sample=True,
early_stopping=True,
pad_token_id=self.tokenizers['grammar'].pad_token_id
)
# Decode and clean the corrected text
corrected_text = self.tokenizers['grammar'].decode(
outputs[0],
skip_special_tokens=True
)
# Remove the input prefix if present
if corrected_text.startswith("grammar:"):
corrected_text = corrected_text[8:].strip()
# Cache the result
self._cache_response(cache_key, corrected_text)
return corrected_text
except Exception as e:
logger.error(f"Grammar correction failed: {str(e)}")
return text # Return original text if correction fails
def enhance_style(self, text: str) -> str:
"""
Improve text style, readability, and flow while preserving
meaning through sentence-level processing and enhancement.
Args:
text: Grammar-corrected text requiring style improvement
Returns:
str: Style-enhanced text with improved readability
"""
try:
# Check cache first
cache_key = self._generate_cache_key(f"style_{text}")
if cache_key in self.response_cache:
return self.response_cache[cache_key]
# Split text into sentences for individual processing
sentences = self._split_into_sentences(text)
enhanced_sentences = []
for sentence in sentences:
if len(sentence.strip()) < 10:
enhanced_sentences.append(sentence)
continue
try:
# Prepare input for style model
input_text = f"paraphrase: {sentence}"
# Tokenize with appropriate length limits
inputs = self.tokenizers['style'].encode(
input_text,
return_tensors="pt",
max_length=256,
truncation=True,
padding=True
).to(self.device)
# Generate enhanced version
with torch.no_grad():
outputs = self.models['style'].generate(
inputs,
max_length=256,
num_beams=3,
temperature=0.8,
do_sample=True,
early_stopping=True,
pad_token_id=self.tokenizers['style'].pad_token_id
)
enhanced_sentence = self.tokenizers['style'].decode(
outputs[0],
skip_special_tokens=True
)
# Clean the enhanced sentence
if enhanced_sentence.startswith("paraphrase:"):
enhanced_sentence = enhanced_sentence[11:].strip()
enhanced_sentences.append(enhanced_sentence)
except Exception as e:
logger.warning(f"Style enhancement failed for sentence: {str(e)}")
enhanced_sentences.append(sentence) # Keep original if enhancement fails
result = ' '.join(enhanced_sentences)
# Cache the result
self._cache_response(cache_key, result)
return result
except Exception as e:
logger.error(f"Style enhancement failed: {str(e)}")
return text # Return original text if enhancement fails
def extend_content(self, text: str, extension_length: int = 100) -> str:
"""
Generate additional relevant content that extends the input text
while maintaining consistency in style, tone, and subject matter.
Args:
text: Original text to extend
extension_length: Desired length of extension in tokens
Returns:
str: Extended text with additional coherent content
"""
try:
# Check cache first
cache_key = self._generate_cache_key(f"extend_{text}_{extension_length}")
if cache_key in self.response_cache:
return self.response_cache[cache_key]
# Analyze text to create appropriate extension prompt
last_sentences = self._get_last_sentences(text, 2)
extension_prompt = f"{last_sentences} Furthermore,"
# Generate extension using the content generation pipeline
generated_extension = self.pipelines['generator'](
extension_prompt,
max_length=len(extension_prompt.split()) + extension_length,
num_return_sequences=1,
temperature=0.8,
do_sample=True,
pad_token_id=50256
)
# Extract and clean the generated text
extension_text = generated_extension[0]['generated_text']
extension_text = extension_text.replace(extension_prompt, '').strip()
# Clean up the extension
extension_text = self._clean_generated_text(extension_text)
# Combine original text with extension
if extension_text:
result = f"{text}\n\n{extension_text}"
else:
result = text
# Cache the result
self._cache_response(cache_key, result)
return result
except Exception as e:
logger.error(f"Content extension failed: {str(e)}")
return text # Return original text if extension fails
def process_file_input(self, file_path: str) -> str:
"""
Process various file formats to extract text content
with comprehensive error handling and format support.
Args:
file_path: Path to the uploaded file
Returns:
str: Extracted text content from the file
"""
try:
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
file_extension = os.path.splitext(file_path)[1].lower()
if file_extension in ['.txt', '.md']:
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
else:
raise ValueError(f"Unsupported file format: {file_extension}")
return content.strip()
except Exception as e:
logger.error(f"File processing error: {str(e)}")
raise Exception(f"Error processing file: {str(e)}")
def chat_interface(self, user_input: str, file_path: Optional[str] = None,
operation_mode: str = 'improve') -> str:
"""
Main interface method handling user interactions with comprehensive
processing pipeline and intelligent operation routing.
Args:
user_input: Direct text input or instructions from user
file_path: Optional path to uploaded file
operation_mode: Processing mode ('improve', 'extend', 'both')
Returns:
str: Processed and improved text response
"""
try:
# Determine input source and extract text
if file_path:
text_to_process = self.process_file_input(file_path)
logger.info(f"Processing file: {file_path}")
else:
text_to_process = user_input
logger.info("Processing direct text input")
# Validate input
if not text_to_process or len(text_to_process.strip()) < 5:
return "Please provide text that is at least 5 characters long for processing."
# Detect language
detected_language = self.detect_language(text_to_process)
# Process text through the complete pipeline
processed_text = self._process_text_pipeline(
text_to_process,
detected_language,
operation_mode
)
return processed_text
except Exception as e:
logger.error(f"Chat interface error: {str(e)}")
return f"Error processing your request: {str(e)}"
def _process_text_pipeline(self, text: str, language: str, mode: str) -> str:
"""
Execute the complete text processing pipeline with comprehensive
error handling and optimization for different operation modes.
Args:
text: Input text to process
language: Detected language code
mode: Operation mode ('improve', 'extend', 'both')
Returns:
str: Fully processed text
"""
try:
# Preprocess text
preprocessed_text = self.preprocess_text(text)
if mode in ['improve', 'both']:
# Apply grammar correction
grammar_corrected = self.improve_grammar(preprocessed_text, language)
# Apply style enhancement
style_enhanced = self.enhance_style(grammar_corrected)
result_text = style_enhanced
else:
result_text = preprocessed_text
if mode in ['extend', 'both']:
# Extend content if requested
extended_text = self.extend_content(result_text)
result_text = extended_text
return result_text
except torch.cuda.OutOfMemoryError:
# Handle GPU memory issues with CPU fallback
logger.warning("GPU memory exhausted, falling back to CPU processing")
torch.cuda.empty_cache()
return self._process_with_cpu_fallback(text, language, mode)
except Exception as e:
logger.error(f"Pipeline processing error: {str(e)}")
return f"Processing error: {str(e)}"
def _process_with_cpu_fallback(self, text: str, language: str, mode: str) -> str:
"""
Fallback processing method using CPU resources when GPU
memory is insufficient or unavailable.
Args:
text: Input text to process
language: Language code
mode: Operation mode
Returns:
str: Processed text using CPU resources
"""
# Temporarily move models to CPU
original_device = self.device
self.device = torch.device("cpu")
for model_name in self.models:
self.models[model_name] = self.models[model_name].cpu()
try:
# Process with CPU
result = self._process_text_pipeline(text, language, mode)
finally:
# Restore original device configuration
self.device = original_device
if torch.cuda.is_available():
for model_name in self.models:
self.models[model_name] = self.models[model_name].to(self.device)
return result
def _split_into_sentences(self, text: str) -> List[str]:
"""
Split text into individual sentences with sophisticated
boundary detection and edge case handling.
Args:
text: Input text to split into sentences
Returns:
List[str]: List of individual sentences
"""
# Enhanced sentence splitting pattern
sentence_pattern = r'(?<=[.!?])\s+(?=[A-Z])'
sentences = re.split(sentence_pattern, text)
# Filter out empty sentences and clean whitespace
sentences = [s.strip() for s in sentences if s.strip()]
return sentences
def _get_last_sentences(self, text: str, count: int = 2) -> str:
"""
Extract the last few sentences from text for context
in content extension operations.
Args:
text: Input text
count: Number of sentences to extract
Returns:
str: Last sentences joined together
"""
sentences = self._split_into_sentences(text)
if len(sentences) <= count:
return text
return ' '.join(sentences[-count:])
def _clean_generated_text(self, text: str) -> str:
"""
Clean and normalize generated text by removing artifacts
and ensuring proper formatting.
Args:
text: Generated text requiring cleaning
Returns:
str: Cleaned and formatted text
"""
# Remove common generation artifacts
text = re.sub(r'\[.*?\]', '', text) # Remove bracketed content
text = re.sub(r'<.*?>', '', text) # Remove angle-bracketed content
# Fix spacing and punctuation
text = re.sub(r'\s+', ' ', text)
text = text.strip()
# Ensure text ends with proper punctuation
if text and not text[-1] in '.!?':
text += '.'
return text
def _generate_cache_key(self, text: str) -> str:
"""
Generate a unique cache key for response caching
using content hashing for efficient lookup.
Args:
text: Input text for cache key generation
Returns:
str: Unique cache key
"""
return hashlib.md5(text.encode()).hexdigest()
def _cache_response(self, key: str, response: str):
"""
Cache response with size limit management to prevent
excessive memory usage.
Args:
key: Cache key
response: Response to cache
"""
if len(self.response_cache) >= self.cache_size_limit:
# Remove oldest entry
oldest_key = next(iter(self.response_cache))
del self.response_cache[oldest_key]
self.response_cache[key] = response
def get_system_status(self) -> Dict:
"""
Return comprehensive system status including model states,
memory usage, and performance metrics.
Returns:
Dict: Complete system status information
"""
status = {
'models_loaded': len(self.models),
'pipelines_loaded': len(self.pipelines),
'device': str(self.device),
'cache_size': len(self.response_cache),
'cache_limit': self.cache_size_limit
}
if torch.cuda.is_available():
status['gpu_available'] = True
status['gpu_memory_allocated'] = torch.cuda.memory_allocated()
status['gpu_memory_cached'] = torch.cuda.memory_reserved()
else:
status['gpu_available'] = False
return status
def clear_cache(self):
"""Clear the response cache to free memory."""
self.response_cache.clear()
logger.info("Response cache cleared")
# Example usage and demonstration
def main():
"""
Demonstrate the complete functionality of the text beautification
chatbot with various input types and processing modes.
"""
print("Initializing Text Beautification Chatbot...")
chatbot = TextBeautificationChatbot()
# Example 1: Direct text improvement
sample_text = """
this is a sample text that has some grammar mistakes and could use some improvement.
the sentences are not very well structured and the overall quality could be better.
"""
print("\n" + "="*60)
print("EXAMPLE 1: Text Improvement")
print("="*60)
print("Original text:")
print(sample_text)
improved_text = chatbot.chat_interface(sample_text, operation_mode='improve')
print("\nImproved text:")
print(improved_text)
# Example 2: Content extension
article_text = """
Artificial intelligence has revolutionized many industries in recent years.
Machine learning algorithms can now process vast amounts of data and identify
patterns that humans might miss. This capability has led to breakthroughs
in healthcare, finance, and technology sectors.
"""
print("\n" + "="*60)
print("EXAMPLE 2: Content Extension")
print("="*60)
print("Original article:")
print(article_text)
extended_text = chatbot.chat_interface(article_text, operation_mode='extend')
print("\nExtended article:")
print(extended_text)
# Example 3: Complete processing (improvement + extension)
print("\n" + "="*60)
print("EXAMPLE 3: Complete Processing")
print("="*60)
print("Original text:")
print(sample_text)
complete_processed = chatbot.chat_interface(sample_text, operation_mode='both')
print("\nCompletely processed text:")
print(complete_processed)
# Display system status
print("\n" + "="*60)
print("SYSTEM STATUS")
print("="*60)
status = chatbot.get_system_status()
for key, value in status.items():
print(f"{key}: {value}")
if __name__ == "__main__":
main()
This complete implementation demonstrates all the concepts discussed in the article and provides a fully functional text beautification chatbot. The system handles grammar correction, style improvement, content extension, file processing, and multilingual support while maintaining robust error handling and performance optimization. The example usage section shows how to interact with the chatbot for different types of text processing tasks.
No comments:
Post a Comment