Processing 21 million international trade shipment records in under 60 minutes using a hybrid ML pipeline for HSN code classification and multilingual product attribute extraction at under $50 per million records.
Improved from a ~40% baseline through hierarchical pattern matching
Sustained processing throughput at production scale
In a single batch in under 60 minutes
Fully offline, zero external API dependency
Global Trade
Global
Enterprise — 21M+ Record Production Dataset
Q4 2024 - Completed
International trade shipment descriptions arrive in multiple languages, inconsistent formats, and with missing qualifiers — making automated HSN code classification and structured product extraction extremely difficult at scale. The existing system achieved only ~40% product extraction accuracy due to over-segmentation of product names, missing key qualifiers, and training data misalignment with real-world shipment descriptions. Cloud API-based solutions were prohibitively expensive at the volumes required and could not meet offline deployment requirements.
We replaced a failing ~40% accuracy system with a production-grade hybrid ML pipeline processing 21 million trade records in under 60 minutes at 350K+ records per minute. The hierarchical product extraction engine eliminates over-segmentation, the NER-based attribute extractor handles multilingual inputs, and ONNX INT8 quantization delivers all of this fully offline at under $50 per million records — with zero external API dependency and no GPU infrastructure required.
Improved product extraction accuracy from ~40% to 85%+ through hierarchical pattern matching and transformer-based classification
Achieved ~80% attribute extraction coverage across Brand, Type, Processing, Grade, Form, and Origin fields
Maintained classification performance consistently across multilingual inputs without any external API dependencies
Processed 21 million trade records in under 60 minutes at 350K+ records per minute sustained throughput
Deployed fully offline at under $50 per million records eliminating cloud inference costs entirely
ONNX INT8 quantization delivered production inference speed on CPU without GPU infrastructure
Single product descriptions like 'INDIAN GREEN COFFEE' were being incorrectly split into multiple separate entities, causing systematic extraction failures that drove baseline accuracy to ~40%.
Implemented hierarchical pattern matching that identifies and preserves compound product names as single entities, prioritizing longer, more specific patterns before attempting shorter generic matches.
Product extraction accuracy improved from ~40% to 85%+
Shipment descriptions contained mixed languages, shipping terminology, container details, unit inconsistencies, and arbitrary formatting — breaking any language-specific model trained on clean data.
Built a comprehensive preprocessing pipeline with automatic language detection, noise removal for shipping terms and container markers, unit normalization, and multilingual support feeding into cross-lingual extraction models.
Consistent extraction performance across multilingual datasets with no manual preprocessing required
Processing 21 million records through cloud API inference would be cost-prohibitive and violate secure offline deployment requirements for the enterprise environment.
Applied ONNX INT8 quantization to compress models for CPU inference and joblib multiprocessing for parallelized batch processing, achieving production throughput without cloud dependency.
21 million records processed in under 60 minutes at under $50 per million records, fully offline
Model outputs did not match the format and patterns in training data, creating systematic misclassification that could not be resolved by retraining without fixing the underlying alignment issue.
Audited and realigned all extraction patterns against actual training data examples, iteratively validating outputs against expected formats before finalising the production pipeline.
Systematic misclassification eliminated with consistent 85%+ accuracy across production volumes
Contact our ML engineering team to discover how hybrid AI pipelines can transform your international trade data operations.
Get Started Today